Table of Contents

In this post, we’ll start learning about **factors in R**. One of the most important uses of factors is in statistical modeling.

# What’s a factor in R and how to use it?

Before learning about factors, we should know that can have two types of variables in R, namely categorical variables or factors and continuous variables.

A categorical variable or a factor can belong to a limited number of categories.A continuous variable, on the other hand, can correspond to an infinite number of values.

Since categorical variables enter into statistical models differently than continuous variables, storing data as factors ensures that the modeling functions will treat such data correctly.

We will use the function factor() to create factor. Firstly, let us generate a vector that contains observations belonging to a finite number of categories.

The function factor() will encode the created vector as a factor.

For example, lets create a vector that contains the gender of 5 different individuals:

gender_vector <- c(“Male”,”Male”,”Female”,”Male”,”Female”)

You can see that there are two ‘factor levels’ : “Female” and “Male”.

Let’s generate a factor vector from this character vector containing the gender of these 5 individuals by applying factor().

1 2 3 4 5 6 7 |
gender_vector <- c("Male", "Female", "Female", "Male", "Male") # Convert gender_vector to a factor factor_gender_vector <- factor(gender_vector) # Print factor_gender_vector factor_gender_vector |

We can classify categorical variables into two types.

Nominal Variables

A nominal variable does not have an implicit order.

That means we cannot say that one is more or less than the other. For example,

- Gender (Male, Female, Transgender).
- Eye color (Blue, Green, Brown, Hazel).
- Type of pet (Dog, Cat, Fish, Bird).

Assume the categorical variable pets_vector with the categories “Dog”, “Cat”, “Fish” and “Bird”. Here, we cannot say that one is more or less than the other.

Ordinal Variables

On the other hand, ordinal variables have a meaningful order. Usually, this order is very much clear when we look at the values within such variables, and an order is implicitly present.

Consider a variable like work experience (with values between 0-2 years as low, 2 – 6 as medium, 6 + as huge). Even though we can order these from lowest to highest, the spacing between the values may not be the same across the levels of the variables.

Another example is , when we talk about economic status , we can consider the categories: “Low”, “Medium” and “High”. It is implicitly understood that “High” stands above “Medium” and “Medium” stands above “Low”.

# How to change names of Factor levels?

Sometimes you might want to change the names of the factor levels for deeper analysis on a dataset . The function levels() in R allows you to change names of factor levels.

Let’s create a gender_vector for our analysis:

1 2 |
gender_vector <- c("F", "F", "M" ,"M", "M") factor_gender_vector <- factor(gender_vector) |

If you type levels(factor_gender_vector), you’ll see that it outputs [1] “F” “M”.

Recording the gender as “F” and “M” can be confusing while analyzing the data.

If you don’t mention the factor levels while creating a vector, R will assign them alphabetically. To map “F” to “Female” and “M” to “Male”, the levels should be set to c(“Female”, “Male”), in this particular order.

Let’s modify the factor levels of factor_gender_vector, by setting levels(factor_gender_vector) to c(“Female”, “Male”). We need to take care of the order of the vector elements.

# summary() in R

summary() will give you a quick overview of the contents of a variable.

Suppose you would like to know how many “Female” entries you have in the data, and how many “Male” entries. We can use the summary() function to know the answer to these questions.

1 |
summary(factor_gender_vector) |

Take a look at the output. Since we identified “Male” and “Female” as factor levels in factor_gender_vector , it allows R to show the number of elements for each category of the gender “Male” and “Female”.

1 2 3 4 5 6 7 |
> summary(gender_vector) Length Class Mode 5 character character > summary(factor_gender_vector) Female Male 2 3 > |

# Compare factors in R

Let’s understand what happens when we compare factor levels of a nominal variable. We have a factor with two levels: “Female” and “Male” in factor_gender_vector.

Can we find out whether the factor level Male is greater then the factor level Female or vice-versa?

1 2 3 |
Female <- factor_gender_vector[2] Male <- factor_survey_vector[3] Female > Male |

You can observe that we get NA when you try to compare factor levels of a nominal variable, since there is no implicit order.

Let’s now discuss about ordered factors , where more meaningful comparisons are possible due to the existence of an ordering within the factor levels.

# Ordered factors in R

Occasionally you’ll also deal with factor vectors which do have a natural ordering between its own levels. If that is true, we need to be certain we pass this info to R.

Considering that “Male” and “Female” are unordered (or nominal ) factor levels, R yields a warning message, telling you the greater than operator isn’t meaningful. As seen earlier, R attaches an equal worth to such factor levels.

Let’s think that you’re leading a group of five software developers and you would like to appraise their performance.

To achieve this, you monitor their speed, assess each programmer as “bad”, “average” or “good” and save the results in performance_vector.

Each entry should be one of “bad”, “average”, or “good”. Consider the vector below:

performance_vector <- c(“average”,”bad”,”bad”,”average”,”good”)

performance_vector ought to be converted to an ordinal factor because its categories have a natural ordering.

By default, the function factor () transforms performance_vector to an unordered factor. To make an ordered factor, you have to add two additional arguments: **ordered** and **levels**.

Along with the **ordered** argument, we need to set the argument **levels. **Using** levels, **you provide the values of the factor in the right order.

From performance_vector, make an ordered factor vector:factor_performance_vector. Set ordered to TRUE, and set levels to c(“bad”, “average”, “good”).

1 2 3 4 5 6 |
performance_vector <- c("medium", "slow", "slow", "medium", "fast") factor_performance_vector <- factor(performance_vector, ordered = TRUE, levels = c("slow", "medium", "fast")) sde2 <-factor_performance_vector[2] sde5 <-factor_performance_vector[5] sde2 > sde5 |

The simple fact that factor_performance_vector is currently ordered empowers us to compare unique components (software developers in this instance ). You can easily do that by using the well-known comparison operators.

Let’s check if sde2 is better than sde5.

That’s all for now on factors in R. Feel free to comment on this article in the section below: