Where to find practice datasets
Five data repositories to boost your practice
6/2/20252 min read
Discover real datasets to enhance your R learning journey
When learning R it's good to to use "real" datasets to apply what you've learned. It makes the learning more interesting and more engaging - look at space data for astronomy enthusiasts, move data for film buffs or even flower data for gardening enthusiasts. Here are some go-to data repositories that will give you plenty of data to explore and practice new functions on.
1.Built-in R Datasets
Run data() in R to see what datasets are built-in. This includes a wide variety of datasets from susn spot data to entries to UC Berkeley. It includes time series data as well as static data so it's pretty rich for analysis.
You often see these datasets being used to demonstrate the use of functions. The "iris" dataset seems to be used frequently, but I think it's a pretty dull dataset compared to what's available elsewhere!
2.Tidy Tuesday
Website: https://github.com/rfordatascience/tidytuesday
Every Tuesday a dataset is released with a different topic. It's a heavily used data repository and you can find lots of YouTube tutorials to follow along as people go through some of the datasets.
For actuaries: If you want to practice some survival analysis, then have a look at the "Alone data" from January 2023. For practice on life expectancy calculations and modeling, check out the "Life Expectancy" dataset from December 2023.
3. Kaggle Datasets
Website: https://www.kaggle.com/datasets
Kaggle hosts thousands of datasets that are regularly updated by the community. You'll find everything from government data and sports statistics to medical research and financial data. Many datasets come with code examples and notebooks to help you get started, making it perfect for R practice.
4. Awesome Public Datasets
Website: https://github.com/awesomedata/awesome-public-datasets
The title pretty much says everything! This repository contains lots of different topics and lots of different datasets. It's a comprehensive collection that covers virtually every domain you can think of.
5. Google Dataset Search
Website: https://datasetsearch.research.google.com/
Finally... if none of these pique your interest, then you can always use Google Dataset Search to find something else. It's like Google, but specifically for finding datasets across the web.
Happy coding and data exploring! 📊📈
Additional Tips for Using These Repositories:
Start small: Begin with datasets that have fewer than 10,000 rows while learning
Read the documentation: Most repositories provide data dictionaries explaining what each column means
Join communities: Many of these platforms have active communities where you can ask questions
Practice regularly: Try to work with a new dataset at least once a week
Share your work: Post your analyses on platforms like GitHub or LinkedIn to build your portfolio
Why Real Data Matters:
Working with real datasets is crucial for developing practical R skills because:
Messy data: Real data often needs cleaning, which teaches you important data wrangling skills
Context matters: Understanding the domain helps you ask better analytical questions
Practical constraints: Real datasets teach you to work within limitations and make assumptions
Portfolio building: Real analyses are more impressive to potential employers than textbook examples
The best way to learn R is by doing. Pick a dataset that interests you, start exploring, and don't be afraid to make mistakes – that's how you move forward.