How to read data using pandas read_csv

In this post, we will learn about the pandas read_csv function.
It is used to read a csv(comma separated values) file  and convert to pandas dataframe.

pandas is a very important library used in data science projects using python.
Lets convert a csv file containing data about Fortune 500 companies into pandas dataframe.

pandas read_csv example

pandas read_csv tutorial

Lets now try to understand what are the different parameters of pandas read_csv and how to use them.

sep

If the separator between each field of your data is not a comma, use the sep argument.For example, we want to change these pipe separated values to a dataframe using pandas read_csv separator.

pandas read_csv -data with | separator

Use sep = “|”  as shown below

pandas read_csv - dataframe from pipe delimited data

delimiter

The delimiter argument of pandas read_csv function is same as sep. A question may arise , if both sep and delimiter are same, then why do have two arguments. I think it is for backward compatibility.

sep is more commonly used than delimiter.

header

Use pandas read_csv header to specify which line in your data is to be considered as header.For example, the header is already present in the first line of our dataset shown below(note the bolded line).

csv file to understand pandas read_csv

In this case, we need to either use header = 0 or don’t use any header argument.

pandas read_csv header = 0 example

header = 1 means consider second line of the dataset as header.

pandas read_csv header = 1 example

Load csv with no header using pandas read_csv

If your csv file does not have header, then you need to set header = None while reading it .Then pandas will use auto generated integer values as header.

sample csv file without header

pandas read_csv no header example

 

names

Use the names attribute if you would want to specify column names to the dataframe explicitly. All the column names should be mentioned within a list.

Refer to the example below where I am mentioning the column names as COLUMN1, COLUMN2, ..till COLUMN18. Notice how the header of the dataframe changes  from earlier header.

pandas read_csv assigning names to dataframe columns

index_col

Use this argument to specify the row labels to use. If you set index_col to 0, then the first column of the dataframe will become the row label. Notice how the row labels change.

Earlier the row labels were 0,1,2,…etc. Now, the row labels have changed to Walmart, State Grid etc.

pandas read_csv assigning row labels using index_col

Note that you can specify more than one column index (or) a combination of multiple column names as argument for index_col .

In such a case, you need to enclose all of these column indexes or column names in a list.

So I can either have [0,1] (or) [“company”, “rank”] as index_col.

See the impact of this change on the dataframe in the example below. Now the values of both “company” and “rank” constitute the row label.

 

pandas read_csv assigning row labels using multiple column names or index index_col example 2

pandas read_csv usecols

Use pandas usecols when you want to load specific columns into dataframe. When your input dataset contains a large number of columns, and you want to load a subset of those columns into a dataframe , then usecols will be very useful.

Performance wise, it is  better because instead of loading an entire dataframe into memory and then deleting the not required columns, we can select the columns that we’ll need, while loading the dataset itself.

As a parameter to usecols , you can pass either a list of strings corresponding to the column names or a list of integers corresponding to column index . Refer to below example where I am passing a list of strings as usecols parameter.

pandas read_csv usecols

You can also use column index positions as parameter to usecols

passing list of column indexes as parameter to pandas read_csv usecols

Note that element order is ignored while using usecols. So , if you pass [2,1,0] or [0,1,2] as parameter to usecols, the resulting dataframe columns will have same column order , namely company, rank, revenues !

element order is ignored while using usecols

example for read_csv usecols ignoring element order while parsing

This behaviour holds true when passing list of column names as well. So, if you pass [“company”, “rank”, “revenues”] or [“rank, “company”, “revenues”] to pandas usecols, the resulting dataframe will have same order of columns , namely “company”, “rank” “revenues” !

If you want column order to be enforced while using usecols , then you need to pass the list containing column names explicitly – see below.

example showing column order being mentioned while using pandas read_csv usecols

Using pandas read_csv  to skip columns while reading

One more use of the usecols parameter is to skip certain columns in your dataframe. See an example below.I am using a callable as a  usecols  parameter in order to exclude the columns  – company, rank, and revenues, and retain all the other columns. Notice the change in the columns present in the dataframe , after using usecols.

using pandas read_csv to skip columns while reading into dataframe

squeeze

If your dataset contains only one column, and you want to return a Series from it , set the squeeze option to True.

I created a file containing only one column, and read it using pandas read_csv by setting squeeze = True.We will get a pandas Series object as output, instead of pandas Dataframe.

dataset with one column only

pandas read_csv squeeze

prefix

When a data set doesn’t have any header , and you try to convert it to dataframe by (header = None), pandas read_csv generates dataframe column names automatically with integer values 0,1,2,…

If we want to prefix each column’s name with a string, say, “COLUMN”, such that dataframe column names will now become COLUMN0, COLUMN1, COLUMN2 etc. we use prefix argument.

pandas read_csv prefix

mangle_dupe_cols

If a dataset has duplicate column names, convert it to a dataframe by setting mangle_dupe_cols to True.

Refer to example below. The revenues column appears twice.

dataset with duplicate column names

In the dataframe, the second revenues column will be named as revenues.1

pandas dataframe with duplicate column names

pandas read_csv dtype

Use  dtype to set the datatype for the data or dataframe columns. If you want to set data type for mutiple columns, separate them with a comma within the dtype parameter, like {‘col1’ : “float64”, “col2”: “Int64”}

In the below example, I am setting data type of “revenues” column to float64.

pandas read_csv dtype

engine

Used to specify the parsing engine. Can be  python or  c based.

c engine is faster but python based parser is more feature rich.

converters

Use converters to convert values in certain columns, using a dict of functions.

In the below example, I have declared a function f which replaces decimal point to a comma. I am using converters to call the function f on profits column.

Upon doing this, the decimal point in profits column gets changed to comma(,)

pandas read_csv converter

true_values , false_values

Suppose your dataset contains Yes and No string which you want to interpret as True and False.

pandas read_csv true_values false_values

we can tell Pandas to convert ‘Yes’ strings to True and ‘No’ string to False using true_values and false_values

pandas read_csv true_values false_values

skipinitialspace

If you have leading or trailing spaces in a field, then pandas read_csv may not work as expected.

Note the below example where you can see two spaces in first two rows of col2.

Use skipinitialspace in this scenario to interpret ‘No’ as False. If skipinitialspace is not set to True, then col2 will still have No instead of False.

skiprows

You can use pandas read_csv skip rows to

  • Exclude reading specified number of rows from the beginning of a csv file , by passing an integer argument (or)
  • Skip reading specific row indices from a csv file, by passing a list containing row indices to skip.

Lets use the below dataset to understand skiprows

dataset to demonstrate skiprows argument of pandas read_csv

To skip reading the first 4 rows from this csv file, you can use skiprows = 4

read_csv skip rows by specifying number of rows to exclude from beginning of file

To skip reading rows with indices 2 to 4 , you can use

mentioning row positions (or) index as parameter to skiprows

skipfooter

Indicates number of rows to skip from bottom of the file.

Ensure that engine parameter is ‘python’ and not ‘c’ while using skipfooter.

So to skip last 4 rows of a file, you can use skipfooter = 4

nrows

If you want to read a limited number of rows, instead of all the rows in a dataset, use nrows. This is especially useful when reading a large file into a pandas dataframe.

pandas read_csv nrows

pandas read_csv default NaN values

Please note that all these strings are considered as default NaN values by pandas read_csv function – ”, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

In the below example, col1 has spaces and col2  has #N/A. Observe how both these columns are interpreted as NaN when read using pandas read_csv.

keep_default_na

When parsing data, you can choose to include or not the default NaN values.

For example, you don’t want to consider ” and ‘#N/A’ as NaN, then you need to set keep_default_na to False.

na_values

If you want any additional strings to be considered as NaN, other than the default NaN values, then use na_values.

In the below example, I have added the list of strings [“Walmart”, “Energy”]  as an attribute for na_values, so that both these values will be replaced by NaN.

pandas read_csv na_values

In the below example, I want to interpret the string “Energy” within the column “sector” to be interpreted as NaN. Hence I included them in a dictionary attribute of na_values

pandas read_csv – should i use both keep_default_na and na_values ?

Often in data science projects, you might get a scenario where you don’t want to consider all of the default NaN values while parsing.

I mean, suppose you do not want to consider ”  as NaN, but want to consider  ‘#N/A’ as NaN.

Then you need to set keep_default_na to False, and set na_values to ‘#N/A’.

pandas read_csv keep_default_na

Please note the below important points,

If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

Source : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

na_filter

It is used to find and interpret missing values in your dataset.

If you have a large file, and you are sure that data does not contain any missing values, set na_filter as False. This will boost the performance of data parsing.

verbose

Using the verbose = True will print additional information.

The time taken for each stage of converting the file into a dataframe, like tokenization, type conversion and memory clean up, will be printed.

skip_blank_lines

If skip_blank_lines option is set to False, then wherever blank lines are present, NaN values will be inserted into the dataframe.

If set to True, then the entire blank line will be skipped.

dataset with blank lines

pandas read_csv skip_blank_lines

parse_dates

We can use pandas parse_dates to parse columns as datetime. You can either use parse_dates = True or parse_dates = [‘column name’]

Let’s convert col3, that has the string content, to a datetime datatype. If you don’t use parse_dates in the read_csv call, col3 will be represented as an object.

data containing date column - col3

Let’s use parse_dates argument of pandas read_csv and set it to the list of column names that we intend to parse as a datetime. See the dtypes after setting parse_dates to col3. It has been read in as datetime64[ns] now !!

pandas read_csv parse_dates

If pandas is unable to convert a particular column to datetime, even after using parse_dates, it will return the object data type.

infer_datetime_format

If you  set infer_datetime_format to True and enable parse_dates for a column , pandas read_csv will try to parse the data type of that column into datetime quickly .

The advantage of infer_datetime_format is that the parsing happens very quickly.

If it is successful in inferring the datetime format of the string,  then parsing speed will be increased by 10 times !

keep_date_col

Often in data science, you will work with date columns for data analysis.

To understand keep_date_col , let us consider that you want to combine the three columns,  day , month and year and derive a new date column, called date_col

date columns parsing in pandas read_csv

Note that  by defaut, pandas read_csv will just retain the new date column, and drop the columns from which you derived the new date column.

Hence , you can see only date_col in the dataframe’s dtypes

If you set keep_date_col to True , the original date columns, namely day , month and year will be retained , along with the new date column date_col in the pandas dataframe.

pandas read_csv keep_date_col

date_parser

Using the date_parser is the most flexible way to parse a file containing date columns

Suppose you have a column called date_col in your dataset which has date in the format YYYY DD MM HH:MM:SS , like shown below

data to be parsed containing date columns

The easiest way is to write a lambda function which can read the data in this format ,  and pass this lambda function to the date_parser argument .

Observe now , that the date_col has been populated correctly in pandas dataframe.

using pandas read_csv date_parser

dayfirst

Use the dayfirst parameter to indicate that day comes first in your column representing dates.

Consider the below data .

The first row contains “04/10/96” , and pandas considers 04 as the month .

What if, you want pandas to consider 04 as the day instead of month?

how pandas read_csv understands data containing dates by default

Using dayfirst in such a scenario to indicate that the day part comes first in your data

Notice the change in the interpretation by setting dayfirst to True. pandas has interpreted the same value as 4th of October ‘ 1996 , instead of 10th of April ‘ 1996 !

pandas read_csv dayfirst

iterator

Using the iterator parameter , you can define how much data you want to read , in each iteration.By setting iterator to True , the pandas dataframe object will become a TextFileReader object .

You can use df.get_chunk(n) to retrieve the rows from this object. Each execution of get_chunk will retrieve n number of rows from the last retrieved row.

pandas read_csv iterator

pandas read_csv chunksize

By specifying a chunksize , you can retrieve the data in same sized ‘chunks’ .

This is especially useful when reading a huge dataset as part of your data science project.

Instead of putting the entire dataset into memory , this is a ‘lazy’ way to read equal sized portions of the data. Setting chunksize will return a TextFileReader object.

See an example below , i have specified a chunk size of 2 here. So two rows will be returned for each for loop execution.

pandas read_csv chunksize

compression

pandas read_csv has the ability to read compressed files.

By default ,  compression parameter is set to ‘infer’ , which means it will try to decompress the files of the type (gzip , zip , bz2  , xz )  and read them into dataframe.

If you want to do analysis on a huge file , it is always better to use compressed file.

Don’t uncompress the file and try to read to dataframe.

I mean, if you have a gz file , then set compression = ‘gzip’ and work with gz file directly.

thousands

If a column within your dataset contains a comma to indicate the thousands place, and you try to convert this dataset to a dataframe using pandas read_csv ,  then this column would be considered as a string !

Refer to example below.

dataset with column containing a comma

The revenues column contains a comma , and when you try to convert to dataframe, the dtype for revenues becomes an object

check the dtypes without the thousands parameter

To ensure that such a column is interpreted as int64 , you need to indicate to pandas read_csv that comma is a thousands place indicator , like shown below.

Observe that the dtype of revenues column has now changed to int64 instead of object .

usage of pandas read_csv thousands parameter

decimal

Used to indicate which character is to be considered as decimal point indicator.

In some countries within Europe , a comma may be considered as decimal point indicator.We need to set decimal = “,” to parse such data .

lineterminator

Suppose we have data like shown below.You want to convert this data to a pandas dataframe by considering ~ as line terminator

. Then specify lineterminator = ‘~’ within the call to read_csv

pandas read_csv lineterminator

References:

Stack Overflow

Python pandas documentation

Recent Posts

Menu