Exploring Palmer Penguins
Demonstration with Quarto Live
Introduction
Welcome to this interactive lab where we’ll explore data science concepts using the Palmer Penguins dataset. We’ll work with both R and Python to learn:
- Data exploration and visualization
- Basic statistical analysis
- Data manipulation and transformation
- Creating publication-quality plots
The Palmer Penguins dataset contains measurements from three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
Getting Started with R
Let’s first load our required R packages and examine the data:
Exercise 1: Basic Data Exploration
Using R, find out:
- How many penguins are in the dataset?
- What are the unique species?
- What’s the average flipper length?
- Use
nrow()
to count rows - Use
unique()
to find unique values - Use
mean()
withna.rm=TRUE
to calculate means
list(
"Total penguins:" = nrow(penguins),
"Species:" = unique(penguins$species),
"Mean flipper length:" = round(mean(penguins$flipper_length_mm, na.rm = TRUE), 2)
)
Visualizing Data in R
Let’s create some plots to understand relationships in our data.
Exercise 2: Create a Plot
Create a box plot showing bill length by species and sex. Use ggplot2
and color the boxes by sex.
ggplot(penguins, aes(x = species, y = bill_length_mm, fill = sex)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Bill Length by Species and Sex")
ggplot(penguins, aes(x = species, y = bill_length_mm, fill = sex)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Bill Length by Species and Sex")
Python Analysis
Now let’s switch to Python and perform similar analyses:
Exercise 3: Python Data Summary
Create a summary of the numerical variables using pandas:
Use the pandas describe()
method to get summary statistics
penguins.describe()
penguins.describe()
Exercise 4: Statistical Analysis
Using Python, test if there’s a significant difference in flipper length between male and female penguins:
# Perform t-test
t_stat, p_val = stats.ttest_ind(males, females)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")
# Perform t-test
= stats.ttest_ind(males, females)
t_stat, p_val print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")
Visualization with Seaborn
Let’s create an advanced visualization using seaborn:
Exercise 5: Create a Complex Visualization
Create a violin plot showing the distribution of body mass by species, with separate plots for each sex:
Use sns.violinplot()
with the following parameters:
data=penguins
x="species"
y="body_mass_g"
hue="sex"
=(10, 6))
plt.figure(figsize=penguins, x="species", y="body_mass_g", hue="sex")
sns.violinplot(data"Distribution of Body Mass by Species and Sex")
plt.title( plt.show()
Final Challenge
Exercise 6: Data Analysis Project
Choose either R or Python to complete the following analysis:
- Filter the data to include only complete cases (no missing values)
- Calculate the average measurements for each species-sex combination
- Create a visualization showing these averages
- Add error bars showing standard error
Use the code block below and refer to previous examples for guidance:
library(tidyverse)
|>
penguins drop_na()|>
group_by(species, sex) |>
summarise(
mean_flipper = mean(flipper_length_mm),
se_flipper = sd(flipper_length_mm)/sqrt(n())
|>
) ggplot(aes(x=species, y=mean_flipper, fill=sex)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=mean_flipper-se_flipper,
ymax=mean_flipper+se_flipper),
position=position_dodge(0.9),
width=0.25) +
theme_minimal() +
labs(title="Average Flipper Length by Species and Sex",
y="Flipper Length (mm)")
import numpy as np
# Calculate statistics
= penguins.groupby(['species', 'sex'])['flipper_length_mm'].agg(['mean', 'std', 'count']).reset_index()
stats_df 'se'] = stats_df['std'] / np.sqrt(stats_df['count'])
stats_df[
# Create plot
=(10, 6))
plt.figure(figsize= stats_df['species'].unique()
species_list = np.arange(len(species_list))
x = 0.35
width
- width/2, stats_df[stats_df['sex']=='male']['mean'],
plt.bar(x ='Male', yerr=stats_df[stats_df['sex']=='male']['se'])
width, label+ width/2, stats_df[stats_df['sex']=='female']['mean'],
plt.bar(x ='Female', yerr=stats_df[stats_df['sex']=='female']['se'])
width, label
'Species')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Average Flipper Length by Species and Sex')
plt.title(
plt.xticks(x, species_list)
plt.legend() plt.show()
Conclusion
In this lab, we’ve:
- Explored the Palmer Penguins dataset using both R and Python
- Created various types of visualizations
- Performed basic statistical analyses
- Learned about data manipulation techniques
For more information about the dataset, visit the Palmer Penguins website.