Exploring Palmer Penguins

Demonstration with Quarto Live

Introduction

Welcome to this interactive lab where we’ll explore data science concepts using the Palmer Penguins dataset. We’ll work with both R and Python to learn:

Data exploration and visualization
Basic statistical analysis
Data manipulation and transformation
Creating publication-quality plots

The Palmer Penguins dataset contains measurements from three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

Getting Started with R

Let’s first load our required R packages and examine the data:

Exercise 1: Basic Data Exploration

Using R, find out:

How many penguins are in the dataset?
What are the unique species?
What’s the average flipper length?

Visualizing Data in R

Let’s create some plots to understand relationships in our data.

Exercise 2: Create a Plot

Create a box plot showing bill length by species and sex. Use ggplot2 and color the boxes by sex.

ggplot(penguins, aes(x = species, y = bill_length_mm, fill = sex)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Bill Length by Species and Sex")

Python Analysis

Now let’s switch to Python and perform similar analyses:

Exercise 3: Python Data Summary

Create a summary of the numerical variables using pandas:

Exercise 4: Statistical Analysis

Using Python, test if there’s a significant difference in flipper length between male and female penguins:

# Perform t-test
t_stat, p_val = stats.ttest_ind(males, females)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")

Visualization with Seaborn

Let’s create an advanced visualization using seaborn:

Exercise 5: Create a Complex Visualization

Create a violin plot showing the distribution of body mass by species, with separate plots for each sex:

Final Challenge

Exercise 6: Data Analysis Project

Choose either R or Python to complete the following analysis:

Filter the data to include only complete cases (no missing values)
Calculate the average measurements for each species-sex combination
Create a visualization showing these averages
Add error bars showing standard error

Use the code block below and refer to previous examples for guidance:

R Version
Python Version

library(tidyverse)

penguins |>
  drop_na()|>
  group_by(species, sex) |>
  summarise(
    mean_flipper = mean(flipper_length_mm),
    se_flipper = sd(flipper_length_mm)/sqrt(n())
  ) |>
  ggplot(aes(x=species, y=mean_flipper, fill=sex)) +
    geom_bar(stat="identity", position="dodge") +
    geom_errorbar(aes(ymin=mean_flipper-se_flipper, 
                     ymax=mean_flipper+se_flipper),
                 position=position_dodge(0.9),
                 width=0.25) +
    theme_minimal() +
    labs(title="Average Flipper Length by Species and Sex",
         y="Flipper Length (mm)")

import numpy as np

# Calculate statistics
stats_df = penguins.groupby(['species', 'sex'])['flipper_length_mm'].agg(['mean', 'std', 'count']).reset_index()
stats_df['se'] = stats_df['std'] / np.sqrt(stats_df['count'])

# Create plot
plt.figure(figsize=(10, 6))
species_list = stats_df['species'].unique()
x = np.arange(len(species_list))
width = 0.35

plt.bar(x - width/2, stats_df[stats_df['sex']=='male']['mean'], 
        width, label='Male', yerr=stats_df[stats_df['sex']=='male']['se'])
plt.bar(x + width/2, stats_df[stats_df['sex']=='female']['mean'],
        width, label='Female', yerr=stats_df[stats_df['sex']=='female']['se'])

plt.xlabel('Species')
plt.ylabel('Flipper Length (mm)')
plt.title('Average Flipper Length by Species and Sex')
plt.xticks(x, species_list)
plt.legend()
plt.show()

Conclusion

In this lab, we’ve:

Explored the Palmer Penguins dataset using both R and Python
Created various types of visualizations
Performed basic statistical analyses
Learned about data manipulation techniques

For more information about the dataset, visit the Palmer Penguins website.