101-Pandas¶

101 - Pandas¶

Table of Content¶

Pandas Basics¶

Concepts and Definitions¶

DataFrame¶

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure.

Key features of a DataFrame:

Feature	Description
Tabular Data Structure	It represents data in a table format with rows and columns, much like a spreadsheet or SQL table.
Two-Dimensional	Data is arranged in rows and columns. Each row typically represents an observation or an individual data point, while each column represents a different variable or feature.
Size-Mutable	The DataFrame object’s size can be modified after creation, meaning rows and columns can be added, removed, or modified.
Heterogeneous Data Types	Unlike arrays or lists in Python, a DataFrame can contain different data types (e.g., integers, floats, strings, etc.) within the same DataFrame.
Labels and Indexing	Each row and column in a DataFrame has a label or index associated with it, allowing for easy data retrieval, manipulation, and alignment.

The pandas DataFrame is a fundamental data structure used extensively for data cleaning, data analysis, manipulation, transformation, and computation in various fields, including data science, machine learning, finance, and more.

Creating a DataFrame usually involves importing the pandas library and constructing it from various data sources, such as lists, dictionaries, NumPy arrays, or reading data from CSV files, databases, Excel files, etc. Pandas provides a wide range of functionalities to handle, analyze, and process data stored in DataFrames efficiently.

Setting Display Options¶

When exploring the dataframe, it will be displayed like a table with rows and columns. By default pandas displays a default number of rows=15 and columns=15, so some times our data are much bigger, so the inbetween rows and columns will be trimmed from the display.

In case you need to fully visualize a specific number of rows and columns, we can do that by setting the options for that!:

# we can set numbers for how many rows and columns will be displayed
pd.set_option('display.min_rows', 10) #default will be 10 
pd.set_option('display.max_columns', 20)

Refer to pandas documents to know more about options and settings

Methods to Create a DataFrame¶

In pandas there are various methods to create a dataframe depending on the shape of the data you have.

from Lists or Arrays¶

# Creating a DataFrame from lists
data = {'Column1': [1, 2, 3], 'Column2': ['A', 'B', 'C']}
df = pd.DataFrame(data)

# Output:
#   Column1	Column2
# 0	1	A
# 1	2	B
# 2	3	C

from List of Dictionaries¶

# Creating a DataFrame from a list of dictionaries
data = [{'Column1': 1, 'Column2': 'A'},
        {'Column1': 2, 'Column2': 'B'},
        {'Column1': 3, 'Column2': 'C'}]
df = pd.DataFrame(data)

# Output:
#   Column1	Column2
# 0	1	A
# 1	2	B
# 2	3	C

from a Numpy Array¶

import numpy as np

# Creating a DataFrame from a NumPy array
data = np.array([[1, 'A'], [2, 'B'], [3, 'C']])
df = pd.DataFrame(data, columns=['Column1', 'Column2'])

# Output:
#   Column1	Column2
# 0	1	A
# 1	2	B
# 2	3	C

from a Series or Dictionary of Series¶

# Creating a DataFrame from pandas Series
series1 = pd.Series([1, 2, 3])
series2 = pd.Series(['A', 'B', 'C'])
df = pd.DataFrame({'Column1': series1, 'Column2': series2})

# Output:
#   Column1	Column2
# 0	1	A
# 1	2	B
# 2	3	C

from a List of Tuples¶

# Creating a DataFrame from a list of tuples
data = [(1, 'A'), (2, 'B'), (3, 'C')]
df = pd.DataFrame(data, columns=['Column1', 'Column2'])

# Output:
#   Column1	Column2
# 0	1	A
# 1	2	B
# 2	3	C

Reading a CSV file¶

As a Data Scientist, most of our works depends on data, these data are mostly stored as a CSV “Comma Seperated Values” files. The data are called a tabular data as they are have a tabular structure represents a data in a table format with raw and columns. In most common practice, we usually call the data in CSV a DataFrame.

To Read Data from CSV File¶

# Importing pandas libarary
import pandas as pd

# Load the data into a varaible called df "dataframe"
df = pd.read_csv('data.csv')

To Read CSV File from URL¶

url = "https://Some.url-to/file.csv"
df_url = pd.read_csv(url)

pd.read_csv Method¶

According to the documentation, there are multiple provided parameters regarding the reading of the CSV file which pandas support. Let’s talk about the most common used ones.

Paramter	Details
File Path or URL	Provide the file path or URL correctly as the first argument. This can be a local file path or a URL pointing to the CSV file.
Header	Specify whether the CSV file contains a header row. Use the header parameter (default is infer) to indicate whether the first row should be used as column names (header=0) or not (header=None).
Index Column	Define which column(s) to use as the DataFrame index. Use the index_col parameter to specify the column number or name to set as the index. For multiple indices, pass a list of column names/numbers.
Specify Columns	Use the usecols parameter to select specific columns from the CSV file. This can be a list of column names or indices to include in the DataFrame.
Data Types	Use the dtype parameter to specify the data types of columns explicitly. This can be a dictionary where keys are column names and values are the desired data types.
Parsing Dates	If the CSV contains date/time data, use the parse_dates parameter to specify which columns to parse as dates. Pass either a list of column names or indices.
Missing Values Handling	Use the na_values parameter to define the values considered as missing (NaN) during parsing. It can be a string, list, or dictionary of columns and their respective missing values.
Skip Rows	To skip initial rows in the CSV file (e.g., for skipping headers or metadata), use the skiprows parameter.
Handling Quotes	If the CSV file uses quotes to enclose fields, use the quotechar parameter to specify the quote character.
Chunking and Memory Optimization	For large datasets, consider using chunksize to read the file in smaller chunks, allowing for efficient memory usage when processing large files.
Encoding	Use the encoding parameter to specify the character encoding of the CSV file, especially if it’s not in UTF-8 format (encoding=’utf-8’).
Custom Delimiter	Use the delimiter or sep parameter to specify a custom delimiter if the CSV file is not comma-separated (e.g., tab-separated: delimiter=’\t’).

import pandas as pd

# Reading CSV file with specific options
data = pd.read_csv('file.csv', 
                   header=0,
                   index_col='ID',
                   usecols=['ID', 'Name', 'Age'],
                   dtype={'ID': int, 'Age': float},
                   parse_dates=['Date'],
                   na_values={'Age': -1},
                   encoding='utf-8')

Assign Operations while reading a CSV File¶

In some cases, during the reading process, we need to perform an operation on a specific column in our data, like for example; let’s assume we have a datetime column (which values represent dates/timestamps), these values are by default stored as an object (Strings) datatype, so we need to convert them to the correct datatype again.

# Using pandas method .assign, to apply a certain function to each raw of a column.
# It takes a lambda expression to apply a certain function on the target column.
df = pd.read_csv('data.csv').assign(date_col = lambda x: pd.to_datetime(x.date_col))

Let’s assume another case, where we want to rename some columns to a different naming:

# Using pandas method .rename, to change the name of columns by giving a dictionary to where the keys are the original column name, and the values are the target new names.
df = pd.read_csv('data.csv').rename(columns={'data_col':'Date'})

Now, let’s warp these two methods together into a single line of command:

# Using () enclosure to be able to split the code line into several lines
df = (
    pd.read_csv("data.csv")
    .assign(
        data_col = lambda x: pd.to_datetime(x.data_col)
    )
    .rename(
        columns={'date_col':'Date'}
    )
)

A note here, in the .assign method, you can simply create a new column by setting is name as you like instead of data_col, like: my_data = lambda x: pd.to_datetime(x.data_col), so this will create a new column called my_data, and the old column data_col still exists in our data. But it has no benefit to have a two columns, and one of them (the Object datatype one) will not be used in this string format.

Saving DataFrame as CSV File¶

After working on your project, some times we may clean, process, or even feature engineer our data, so we have to be able to store these new modified/created dataframe in a CSV file, so we could simply load that data to continue work on it.

# Save dataframe as csv format
df.to_csv("/Path/to/save/df.csv", index=False)

# Adding parameter: as_index and assign a False value to it, so this will not create an index for these dataframe while storing.

Data Exploring: Retrieving rows from a dataframe¶

After we load the dataframe, we need to take a look to check our data, even this look is a glance look, but it is a catchy one, which helps us to make sure that everything is loaded correctly so far.

To do this, we simply could use .head/.tail method to display the first/last n number of rows

# display first 3 rows
df.head(3)

# int_column	float_column	str_column	bool_column	date_column	category_column	object_column	nullable_column
# 0	1	1.5	a	True	2023-01-01	x	<object object at 0x7f12538191d0>	1.5
# 1	2	2.5	b	False	2023-01-02	y	<object object at 0x7f1253819230>	NaN

# display the last 6 rows
df.tail(6)

Data Exploring: DataFrame Schema¶

To know the schema about the dataframe, we could use .info method:

df.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 8 columns):
#  #   Column           Non-Null Count  Dtype         
# ---  ------           --------------  -----         
#  0   int_column       4 non-null      int64         
#  1   float_column     4 non-null      float64       
#  2   str_column       4 non-null      object        
#  3   bool_column      4 non-null      bool          
#  4   date_column      4 non-null      datetime64[ns]
#  5   category_column  4 non-null      category      
#  6   object_column    4 non-null      object        
#  7   nullable_column  3 non-null      float64       
# dtypes: bool(1), category(1), datetime64[ns](1), float64(2), int64(1), object(2)
# memory usage: 464.0+ bytes

Reference¶

Pandas Docs - Cummunity Tutorials: link
Towardsai-Pandas Complete Tutprial for Data Science in 2022: link
kdnuggets - 5 Advanced Features Pandas: link
bitbucket - learn pandas: link
Github - pandas cookbook: link