What is data science really? Data science is creating new ways to analyze data through algorithms, machine learning, and statistics. That sounds like the super sexy stuff right? Unfortunately, for every machine learning algorithm you apply, the data needs to be preprocessed and formatted to a way that is recognizable by the algorithms you're using. Sometimes you have a data analyst perform these tedious operations for you, but most of the time, that isn't the case.
Pandas is a library which does great at handling this not so sexy side of data science. It's a library that deals fantastically with tabular data (such as Excel or relational database tables). If you have the freedom from data preprocessing then consider yourself lucky, we'll come and find you inside the black box later. But for the rest of us, let's look at the top five functions within the pandas library that will help us in these operations.
Pandas gives us wide selection of methods for importing and outputing different file types into our python script.
This is super handy, because sometimes we may already have in our possession a really well formatted Excel file with all the data we need. We can read this excel sheet directly into pandas using the read_excel() method.
We can draw from many different sources to compile our data together into one location. A must in preprocessing
The pandas concat function has many different parameters that we can use to join our data together in exactly the manner that we want it. This is helpful whenever we need to omit certain fields or emphasize others.
The difference between the two methods are just what you use to select your data in your data frame. The loc method takes strings as the arguement, for the headers of your dataframe. iloc takes an integer position of where the fields are you want to index.
Being able to select subsets of your data is a must, as sometimes as a data scientist we can run into the issue of overfitting our data with too many related fields. Pulling these fields out are crucial to the entire process of analyzing data effectively.
Pandas is undoubtedly powerful. If you're new and just getting started in the field, I have plenty of tutorials on my Youtube channel.
Where I cover topics on how we can start getting away from Excel for preprocessing, automate our processes, and more. The final two pandas methods that are super helpfu are revealed in my video below.