Using Libraries: NumPy, Pandas & Matplotlib

Using Libraries: NumPy, Pandas & Matplotlib

Python's greatest strength is its ecosystem of libraries. This chapter covers three essential libraries used in data science, analysis, and visualization: NumPy, Pandas, and Matplotlib.

Why This Chapter Matters

These three libraries are the foundation of Python's data science stack. Understanding them opens doors to machine learning, data analysis, scientific computing, and business intelligence.

NumPy — Numerical Computing

NumPy (Numerical Python) provides a fast, multi-dimensional array object called ndarray and hundreds of mathematical functions.

Installing NumPy

pip install numpy

Creating Arrays

import numpy as np

# From a list
arr = np.array([1, 2, 3, 4, 5])
print(arr)           # [1 2 3 4 5]
print(arr.dtype)     # int64
print(arr.shape)     # (5,)

# 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix.shape)  # (2, 3)

# Convenience constructors
zeros = np.zeros((3, 4))          # 3x4 array of zeros
ones = np.ones((2, 3))            # 2x3 array of ones
identity = np.eye(3)              # 3x3 identity matrix
range_arr = np.arange(0, 10, 2)  # array([0, 2, 4, 6, 8])
linspace = np.linspace(0, 1, 5)  # 5 evenly spaced points 0 to 1
random_arr = np.random.rand(3, 3) # 3x3 random floats

Array Operations (Vectorized)

NumPy operations apply element-wise without loops — much faster than Python lists.

arr = np.array([1, 2, 3, 4, 5])

print(arr * 2)       # [2 4 6 8 10]
print(arr + 10)      # [11 12 13 14 15]
print(arr ** 2)      # [1 4 9 16 25]
print(arr > 3)       # [False False False True True]

# Element-wise operations between arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b)   # [5 7 9]
print(a * b)   # [4 10 18]
print(np.dot(a, b))   # 32 (dot product)

Indexing and Slicing

arr = np.array([10, 20, 30, 40, 50])
print(arr[0])      # 10
print(arr[1:4])    # [20 30 40]
print(arr[-1])     # 50

# Boolean indexing
print(arr[arr > 25])   # [30 40 50]

# 2D indexing
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix[0, :])    # first row: [1 2 3]
print(matrix[:, 1])    # second column: [2 5 8]
print(matrix[1, 2])    # row 1, col 2: 6

Useful Math Functions

arr = np.array([4, 9, 16, 25])
print(np.sqrt(arr))   # [2. 3. 4. 5.]
print(np.mean(arr))   # 13.5
print(np.std(arr))    # standard deviation
print(np.sum(arr))    # 54
print(np.min(arr))    # 4
print(np.max(arr))    # 25
print(np.sort(arr))   # sorts a copy

Pandas — Data Analysis

Pandas introduces two powerful data structures: Series (1D) and DataFrame (2D table).

Installing Pandas

pip install pandas

Series

A Series is a labeled 1D array.

import pandas as pd

scores = pd.Series([95, 88, 72, 91], index=["Asha", "Leo", "Mina", "Sam"])
print(scores)
print(scores["Asha"])   # 95
print(scores[scores > 85])   # filter

DataFrame

A DataFrame is a 2D table — like a spreadsheet.

data = {
    "Name": ["Asha", "Leo", "Mina", "Sam"],
    "Score": [95, 88, 72, 91],
    "Grade": ["A", "B", "C", "A"]
}

df = pd.DataFrame(data)
print(df)
print(df.shape)         # (4, 3)
print(df.dtypes)        # column types
print(df.describe())    # stats summary
print(df.head(2))       # first 2 rows
print(df.tail(2))       # last 2 rows

Selecting Data

# Select a column
print(df["Name"])
print(df[["Name", "Score"]])   # multiple columns

# Row selection
print(df.iloc[0])        # by integer position
print(df.loc[0])         # by label (same here)

# Conditional filtering
top = df[df["Score"] >= 90]
print(top)

Adding and Modifying Columns

df["Passed"] = df["Score"] >= 60
df["Score_Boosted"] = df["Score"] + 5
df = df.drop(columns=["Score_Boosted"])
df = df.rename(columns={"Score": "Final Score"})

Handling Missing Data

import numpy as np

df.loc[2, "Score"] = np.nan     # set a missing value
print(df.isnull())               # boolean mask
print(df.isnull().sum())         # count missing per column
df_clean = df.dropna()          # drop rows with any NaN
df_filled = df.fillna(0)        # fill missing with 0

Reading and Writing Files

# CSV
df = pd.read_csv("students.csv")
df.to_csv("output.csv", index=False)

# Excel
df = pd.read_excel("data.xlsx")

# JSON
df = pd.read_json("data.json")

Grouping and Aggregation

# Group by Grade and compute mean score
summary = df.groupby("Grade")["Score"].mean()
print(summary)

# Multiple aggregations
summary2 = df.groupby("Grade").agg({"Score": ["mean", "max", "count"]})

Sorting

df_sorted = df.sort_values("Score", ascending=False)

Matplotlib — Data Visualization

Matplotlib is the foundational plotting library for Python.

Installing Matplotlib

pip install matplotlib

Line Plot

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [10, 20, 15, 30, 25]

plt.plot(x, y, marker="o", color="blue", linestyle="--")
plt.title("My Line Chart")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.grid(True)
plt.savefig("chart.png")
plt.show()

Bar Chart

names = ["Asha", "Leo", "Mina"]
scores = [95, 88, 72]

plt.bar(names, scores, color=["green", "orange", "red"])
plt.title("Student Scores")
plt.ylabel("Score")
plt.show()

Scatter Plot

import numpy as np

x = np.random.rand(50)
y = np.random.rand(50)

plt.scatter(x, y, alpha=0.7, color="purple")
plt.title("Scatter Plot")
plt.show()

Histogram

data = np.random.randn(1000)
plt.hist(data, bins=30, color="teal", edgecolor="black")
plt.title("Distribution")
plt.show()

Subplots

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].plot([1, 2, 3], [10, 20, 15])
axes[0].set_title("Line")

axes[1].bar(["A", "B", "C"], [5, 10, 8])
axes[1].set_title("Bar")

plt.tight_layout()
plt.show()

Putting It Together — A Mini Analysis

import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("sales.csv")

# Clean
df = df.dropna(subset=["revenue"])

# Analyze
monthly = df.groupby("month")["revenue"].sum()

# Visualize
monthly.plot(kind="bar", color="steelblue")
plt.title("Monthly Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue ($)")
plt.tight_layout()
plt.savefig("revenue.png")
plt.show()

Common Mistakes

  • forgetting to import numpy as np / import pandas as pd / import matplotlib.pyplot as plt
  • modifying a DataFrame column without understanding copy vs view (use .copy())
  • using a for loop over a DataFrame instead of vectorized operations
  • not calling plt.show() or plt.savefig() to see/save plots
  • ignoring SettingWithCopyWarning from Pandas

Mini Exercises

  1. Create a NumPy array of 1–20 and select all values greater than 10.
  2. Create a DataFrame from a dictionary of your choice and filter rows by a condition.
  3. Read a CSV file with Pandas and print the 5 rows with the highest values in one column.
  4. Plot a bar chart comparing at least 4 categories.
  5. Combine NumPy and Matplotlib to plot a sine wave.

Review Questions

  1. What is the key advantage of NumPy arrays over Python lists for math?
  2. What is the difference between iloc and loc in Pandas?
  3. How do you handle missing values in a Pandas DataFrame?
  4. What is groupby() used for?
  5. How do you save a Matplotlib figure to a file?

Reference Checklist

  • I can create and manipulate NumPy arrays
  • I can create DataFrames from dicts and CSVs
  • I can filter, sort, and group Pandas DataFrames
  • I can handle missing data with dropna() and fillna()
  • I can create line, bar, scatter, and histogram plots
  • I can save plots and build multi-panel figures with subplots