Pandas

Pandas is a Python library used to analyze data and work with data sets. The name is derived from both "Panel Data" and "Python Data Analysis". [1]

Pandas is used to cleanse bad data sets, find correlations, and analyze the dataset using both Data Frames and Series.

To install the Pandas library, install onto your version of Python from command line using the following command:

                
                    
                        
                        pip install pandas

To import the Pandas library into your scripts, use the following code to have access to it as a reference:

In Pandas, the libraries has a property called Series. They are one dimensional arrays which act as a column from a database, and any data type can be stored inside. A singular "cell" of data can be accessed using either the numerical index, or by referencing them using any created labels associated with that index.

Another concept in Pandas are called DataFrames. These act similarly to tables and are closely related to Series, much like a column-table relationship in relational databases and are multidimensional arrays. Big data sets can be easily stored inside a Series structure, and the module supports direct conversion from common file types such as both JSON and CSV extensions. The info() method can be used to print info about the dataset, such as total number of columns, rows, non-null values in each column, and datatype of each column, which is extremely useful for knowing if the dataset needs more cleanup.

[1] Source: W3Schools - Pandas Introduction [1] Source: W3Schools - Pandas Introduction

This module works in conjunction with the Pandas library, and creates a detailed profile report of your dataset. The following information is generated when using the df.profile_report()function:

checks the datatypes of the columns
Any unique, missing, or frequently occurring values
Any duplicate rows of data
Quantile statistics
mean, median, mode, standard deviation, coefficient of variance
creates histograms
Underscores correlations between variables within the dataset
A textual review of the dataset [1]

Here is a sample dataset and a corresponding generated report from that dataset:

data set
report

                
                    
                        
                        pip install pandas-profiling

To use the pandas-profiling module, use the following command in your scripts:

Here is an example of how to generate a report:

[1] Source: Pandas-Profiling: Introduction

Creating a DataFrame

Output: output of above code

Missing data

If you encounter datasets with missing data, you can use the numpy module to fill in any cells with NaN as a value to what you choose, though most commonly, zero is recommended to avoid contaminating the data.

Output:
output of above code

You can also use the interpolate() function to interpolate these values.

The other option as well is to drop any NaN values

Output:
output of above code

Iteration

You can also loop through a dataset's rows or columns as well. Iterating though a data frame's columns, is fairly straight forward. You can treat the DataFrame as a list, and select a particular index to select a certain column. To iterate and select rows, treat the dataframe like a 2D array and use .iterrows() to loop through.

Columns

Output:
output of above code

Rows

Output:
output of above code

Source: Towards Data Science: A Simple Guide to Pandas DataFrames

Link

Description

W3Schools free online tutorial and documentation on the Pandas library.

Link

Description

Covers why Python is so widely used in Data Science, as well as commonly used Python libraries used in Python in data science such as:

Numpy
Pandas
Matplotlib
Sklearn

Link

Description

Video tutorial on using Pandas for Data Science, such as reading the file and explains how to read one in based on conditioning, as well as number of common operations used by the library.

Link

Description

A video series on using the Pandas library including interactive Python programming quizzes.

Pandas

Creating a DataFrame

Missing data

Iteration

Columns

Rows

Pandas Tutorial

Geeks For Geeks - Python for Data Science

Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)

DataCamp - Data Manipulation With Pandas