Using Libraries: NumPy, Pandas & Matplotlib
Python's greatest strength is its ecosystem of libraries. This chapter covers three essential libraries used in data science, analysis, and visualization: NumPy, Pandas, and Matplotlib.
Why This Chapter Matters
These three libraries are the foundation of Python's data science stack. Understanding them opens doors to machine learning, data analysis, scientific computing, and business intelligence.
NumPy — Numerical Computing
NumPy (Numerical Python) provides a fast, multi-dimensional array object called ndarray and hundreds of mathematical functions.
Installing NumPy
pip install numpy
Creating Arrays
import numpy as np
# From a list
arr = np.array([1, 2, 3, 4, 5])
print(arr) # [1 2 3 4 5]
print(arr.dtype) # int64
print(arr.shape) # (5,)
# 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix.shape) # (2, 3)
# Convenience constructors
zeros = np.zeros((3, 4)) # 3x4 array of zeros
ones = np.ones((2, 3)) # 2x3 array of ones
identity = np.eye(3) # 3x3 identity matrix
range_arr = np.arange(0, 10, 2) # array([0, 2, 4, 6, 8])
linspace = np.linspace(0, 1, 5) # 5 evenly spaced points 0 to 1
random_arr = np.random.rand(3, 3) # 3x3 random floats
Array Operations (Vectorized)
NumPy operations apply element-wise without loops — much faster than Python lists.
arr = np.array([1, 2, 3, 4, 5])
print(arr * 2) # [2 4 6 8 10]
print(arr + 10) # [11 12 13 14 15]
print(arr ** 2) # [1 4 9 16 25]
print(arr > 3) # [False False False True True]
# Element-wise operations between arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # [5 7 9]
print(a * b) # [4 10 18]
print(np.dot(a, b)) # 32 (dot product)
Indexing and Slicing
arr = np.array([10, 20, 30, 40, 50])
print(arr[0]) # 10
print(arr[1:4]) # [20 30 40]
print(arr[-1]) # 50
# Boolean indexing
print(arr[arr > 25]) # [30 40 50]
# 2D indexing
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix[0, :]) # first row: [1 2 3]
print(matrix[:, 1]) # second column: [2 5 8]
print(matrix[1, 2]) # row 1, col 2: 6
Useful Math Functions
arr = np.array([4, 9, 16, 25])
print(np.sqrt(arr)) # [2. 3. 4. 5.]
print(np.mean(arr)) # 13.5
print(np.std(arr)) # standard deviation
print(np.sum(arr)) # 54
print(np.min(arr)) # 4
print(np.max(arr)) # 25
print(np.sort(arr)) # sorts a copy
Pandas — Data Analysis
Pandas introduces two powerful data structures: Series (1D) and DataFrame (2D table).
Installing Pandas
pip install pandas
Series
A Series is a labeled 1D array.
import pandas as pd
scores = pd.Series([95, 88, 72, 91], index=["Asha", "Leo", "Mina", "Sam"])
print(scores)
print(scores["Asha"]) # 95
print(scores[scores > 85]) # filter
DataFrame
A DataFrame is a 2D table — like a spreadsheet.
data = {
"Name": ["Asha", "Leo", "Mina", "Sam"],
"Score": [95, 88, 72, 91],
"Grade": ["A", "B", "C", "A"]
}
df = pd.DataFrame(data)
print(df)
print(df.shape) # (4, 3)
print(df.dtypes) # column types
print(df.describe()) # stats summary
print(df.head(2)) # first 2 rows
print(df.tail(2)) # last 2 rows
Selecting Data
# Select a column
print(df["Name"])
print(df[["Name", "Score"]]) # multiple columns
# Row selection
print(df.iloc[0]) # by integer position
print(df.loc[0]) # by label (same here)
# Conditional filtering
top = df[df["Score"] >= 90]
print(top)
Adding and Modifying Columns
df["Passed"] = df["Score"] >= 60
df["Score_Boosted"] = df["Score"] + 5
df = df.drop(columns=["Score_Boosted"])
df = df.rename(columns={"Score": "Final Score"})
Handling Missing Data
import numpy as np
df.loc[2, "Score"] = np.nan # set a missing value
print(df.isnull()) # boolean mask
print(df.isnull().sum()) # count missing per column
df_clean = df.dropna() # drop rows with any NaN
df_filled = df.fillna(0) # fill missing with 0
Reading and Writing Files
# CSV
df = pd.read_csv("students.csv")
df.to_csv("output.csv", index=False)
# Excel
df = pd.read_excel("data.xlsx")
# JSON
df = pd.read_json("data.json")
Grouping and Aggregation
# Group by Grade and compute mean score
summary = df.groupby("Grade")["Score"].mean()
print(summary)
# Multiple aggregations
summary2 = df.groupby("Grade").agg({"Score": ["mean", "max", "count"]})
Sorting
df_sorted = df.sort_values("Score", ascending=False)
Matplotlib — Data Visualization
Matplotlib is the foundational plotting library for Python.
Installing Matplotlib
pip install matplotlib
Line Plot
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 20, 15, 30, 25]
plt.plot(x, y, marker="o", color="blue", linestyle="--")
plt.title("My Line Chart")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.grid(True)
plt.savefig("chart.png")
plt.show()
Bar Chart
names = ["Asha", "Leo", "Mina"]
scores = [95, 88, 72]
plt.bar(names, scores, color=["green", "orange", "red"])
plt.title("Student Scores")
plt.ylabel("Score")
plt.show()
Scatter Plot
import numpy as np
x = np.random.rand(50)
y = np.random.rand(50)
plt.scatter(x, y, alpha=0.7, color="purple")
plt.title("Scatter Plot")
plt.show()
Histogram
data = np.random.randn(1000)
plt.hist(data, bins=30, color="teal", edgecolor="black")
plt.title("Distribution")
plt.show()
Subplots
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot([1, 2, 3], [10, 20, 15])
axes[0].set_title("Line")
axes[1].bar(["A", "B", "C"], [5, 10, 8])
axes[1].set_title("Bar")
plt.tight_layout()
plt.show()
Putting It Together — A Mini Analysis
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv("sales.csv")
# Clean
df = df.dropna(subset=["revenue"])
# Analyze
monthly = df.groupby("month")["revenue"].sum()
# Visualize
monthly.plot(kind="bar", color="steelblue")
plt.title("Monthly Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue ($)")
plt.tight_layout()
plt.savefig("revenue.png")
plt.show()
Common Mistakes
- forgetting to
import numpy as np/import pandas as pd/import matplotlib.pyplot as plt - modifying a DataFrame column without understanding copy vs view (use
.copy()) - using a for loop over a DataFrame instead of vectorized operations
- not calling
plt.show()orplt.savefig()to see/save plots - ignoring
SettingWithCopyWarningfrom Pandas
Mini Exercises
- Create a NumPy array of 1–20 and select all values greater than 10.
- Create a DataFrame from a dictionary of your choice and filter rows by a condition.
- Read a CSV file with Pandas and print the 5 rows with the highest values in one column.
- Plot a bar chart comparing at least 4 categories.
- Combine NumPy and Matplotlib to plot a sine wave.
Review Questions
- What is the key advantage of NumPy arrays over Python lists for math?
- What is the difference between
ilocandlocin Pandas? - How do you handle missing values in a Pandas DataFrame?
- What is
groupby()used for? - How do you save a Matplotlib figure to a file?
Reference Checklist
- I can create and manipulate NumPy arrays
- I can create DataFrames from dicts and CSVs
- I can filter, sort, and group Pandas DataFrames
- I can handle missing data with
dropna()andfillna() - I can create line, bar, scatter, and histogram plots
- I can save plots and build multi-panel figures with subplots