In this post, we’ll start learning about **factors in R**. One of the most important uses of factors is in statistical modeling.

# What’s a factor in R and how to use it?

Before learning about factors, we should know that can have two types of variables in R, namely categorical variables or factors and continuous variables.

A categorical variable or a factor can belong to a limited number of categories.A continuous variable, on the other hand, can correspond to an infinite number of values.

A good example of a categorical variable is sex. In many circumstances, you can limit the sex categories to “Male” or “Female”.

Since categorical variables enter into statistical models differently than continuous variables, storing data as factors ensures that the modeling functions will treat such data correctly.

To create factors in R, you make use of the function factor(). First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories.

The function factor() will encode the created vector as a factor.

For example, lets create a sex_vector that contains the sex of 5 different individuals:

sex_vector <- c(“Male”,”Female”,”Female”,”Male”,”Male”)

In R-terms, there are two ‘factor levels’, at work here: “Male” and “Female”.

Let’s convert the character vector sex_vector to a factor with factor() and assign the result to factor_sex_vector.

1 2 3 4 5 6 7 |
sex_vector <- c("Male", "Female", "Female", "Male", "Male") # Convert sex_vector to a factor factor_sex_vector <- factor(sex_vector) # Print out factor_sex_vector factor_sex_vector |

There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.

A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other.

In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.

# How to change names of Factor levels?

The names of factor levels can be changed using levels().

When you first get a data set, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function levels():

Let’s create a survey_vector for our analysis:

1 2 |
survey_vector <- c("M", "F", "F", "M", "M") factor_survey_vector <- factor(survey_vector) |

If you type levels(factor_survey_vector), you’ll see that it outputs [1] “F” “M”.

Recording the sex with the abbreviations “F” and “M” can be convenient if you are collecting data with pen and paper, but it can introduce confusion when analyzing the data.

If you don’t specify the levels of the factor when creating the vector, R will automatically assign them alphabetically. To correctly map “F” to “Female” and “M” to “Male”, the levels should be set to c(“Female”, “Male”), in this order.

Let’s change the factor levels of factor_survey_vector, by setting levels(factor_survey_vector) to c(“Female”, “Male”). Mind the order of the vector elements here.

# summary() in R

summary() will give you a quick overview of the contents of a variable.

Suppose you would like to know how many “Female” responses you have in your study, and how many “Male” responses. The summary() function gives you the answer to this question.

1 2 |
# Generate summary for factor_survey_vector summary(factor_survey_vector) |

Have a look at the output. The fact that you identified “Male” and “Female” as factor levels in factor_survey_vector enables R to show the number of elements for each category.

# Compare factors in R

What happens when you try to compare elements of a factor. In factor_survey_vector you have a factor with two levels: “Male” and “Female”. But how does R value these relative to each other?

1 2 3 |
male <- factor_survey_vector[1] female <- factor_survey_vector[2] male > female |

By default, R returns NA when you try to compare values in a factor, since the idea doesn’t make sense.

Let’s now learn about ordered factors, where more meaningful comparisons are possible.

# Ordered factors in R

Sometimes you will also deal with factor vectors that do have a natural ordering between its categories. If this is the case, we have to make sure that we pass this information to R.

Since “Male” and “Female” are unordered (or nominal) factor levels, R returns a warning message, telling you that the greater than operator is not meaningful. As seen before, R attaches an equal value to the levels for such factors.

Let us think that you are leading a team of five software developers and that you want to evaluate their performance.

To do this, you track their speed, evaluate each developer as “bad”, “average” or “good”, and save the results in performance_vector.

Each entry should be either “bad”, “average”, or “good”. Use the list below:

performance_vector <- c(“average”,”bad”,”bad”,”average”,”good”)

performance_vector should be converted to an ordinal factor since its categories have a natural ordering. By default, the function factor() transforms performance_vector into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels.

By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered.

With the argument levels you give the values of the factor in the correct order.

From performance_vector, create an ordered factor vector:factor_performance_vector. Set ordered to TRUE, and set levels to c(“bad”, “average”, “good”).

The fact that factor_performance_vector is now ordered enables us to compare different elements (software developers in this case). You can simply do this by using the well-known operators.

Let’s check if sde2 is better than sde5.

1 2 3 4 5 6 |
performance_vector <- c("medium", "slow", "slow", "medium", "fast") factor_performance_vector <- factor(performance_vector, ordered = TRUE, levels = c("slow", "medium", "fast")) sde2 <-factor_performance_vector[2] sde5 <-factor_performance_vector[5] sde2 > sde5 |