How to read multiple data files into pandas

Let’s check out how to read multiple files into a collection of data frames.

Tools for pandas data import

The primary tool we can use for data import is read_csv. This function accepts the file path of a comma-separated values(CSV) file as input and returns a panda’s data frame directly. read_csv has about 50 optional calling parameters permitting very fine-tuned data import.

pandas has other convenient tools with similar default calling syntax that import various data formats into data frames:

Loading separate files

To read multiple files using pandas, we generally need separate data frames.For example, here we call pd.read_csv twice to read two csv files sales-jan-2015.csv and sales-feb-2015.csv into two distinct data frames.

Using a loop

It’s generally more efficient to iterate over a collection of file names.With that goal, we can create a list of filenames with the two file parts from before. We then initialize an empty list called dataframes and iterate through the list of filenames. Within each iteration we invoke read_csv to read a dataframe from a file and we append the resulting data frame to the dataframes list.

Using a comprehension

We can also do the preceding computation with a list comprehension. Comprehensions are a convenient python construction for exactly this kind of loop where an empty list is appended to within each iteration.

Using glob

When many file names have a similar pattern, that glob module from the Python Standard Library is very useful.

Here we start by importing the function glob from the Builtin glob module. We use the pattern sales*.csv to match any strings that start with the prefix sales and end with the suffix .csv. The asterisk is a wild card that matches zero or more standard characters.

The function glob uses the wildcard pattern to create an iterable object file names containing all matching file names in the current directory. Finally, the iterable file names is consumed in a list comprehension that makes a list called data frames containing the relevant data structures.

