Building a Portfolio for Data Engineering
Practical projects for when you're just starting out.
17th April 2024Â |Â Richard Honour
Are you looking to bolster your data engineering skills and showcase your abilities to potential employers?Â
Settling on the right projects to suitably demonstrate your skills can be difficult.Â
This is especially true if you are just starting out and may not have 100% confidence in your abilities just yet.
In this guide, we'll provide you with practical project suggestions along with step-by-step instructions to help you get started on putting together a portfolio.
We also understand how difficult it can be for some to see a project through to the end. For that reason, projects highlighted with the    Quick Project    tag are specifically designed to be short in scope and hopefully easier to complete in one go.
ETL Microservice
This project involves building a live-updating microservice that extracts raw data from a public API, processes it, stores it, transforms it into a usable format, and finally, makes it available for analysis.
Completing this project demonstrates your proficiency in backend development, working with APIs, Python, Pandas, ETL/ELT processes, cloud platforms, and data analysis.
Suggestion by artfully_rearranged
🧠Prerequisites
You will need a good understanding of the following:
Python
Flask
Pandas
You will need access to the following:
Cloud Provider account (GCP, AWS, Azure etc.)
GitHub account
🪜 Project Guide: ETL Microservice
Choose a Public API
Find a suitable public API that provides the data you're interested in.Â
Ensure that it's free to access.
Set up a Cloud Environment
Create an account with a cloud provider and set up a virtual machine (VM) instance where your microservice will run.Â
Note down the VM's external IP address.
Write a Python Microservice with Flask
Develop a Flask application and create a route to fetch data from the chosen API.
Use the requests library to ingest data and display it as a JSON print statement for debugging.
Store the Data
Set up a database or data lake (e.g., Google Cloud Storage, Amazon S3, or a SQL database) and modify your Flask app to store the raw data into it.
Transform the Data with Pandas
Extract data from the database/lake, use Pandas to transform it into a clean, tabular format, and verify the results.
Host your Microservice on the Cloud
Deploy your Flask microservice on the cloud VM and ensure it fetches updated data at regular intervals.
Host your Code on GitHub
Create a GitHub repository for your project, commit and push your code, and provide comprehensive documentation in the README.
Export for Data Analysis (Optional)
Choose a destination for exporting data (e.g., Google Sheets, BigQuery) and modify your Flask app to export transformed data for analysis.
Basic ETL Script
   Quick Project  Â
This project involves extracting and transforming data from an API.
Completing this project demonstrates your proficiency in...
Suggestion by miscbits
🧠 Prerequisites
You will need a good understanding of the following:
Python
Flask
Pandas
You will need access to the following:
Cloud Provider account (GCP, AWS, Azure etc.)
GitHub account
🪜 Project Guide: Basic ETL Script
Choose a Public API
Find a suitable public API that provides the data you're interested in.Â
Ensure that it's free to access.
Set up a Cloud Environment
Create an account with a cloud provider and set up a virtual machine (VM) instance where your microservice will run.Â
Note down the VM's external IP address.
Write a Python Microservice with Flask
Develop a Flask application and create a route to fetch data from the chosen API.
Use the requests library to ingest data and display it as a JSON print statement for debugging.
Store the Data
Set up a database or data lake (e.g., Google Cloud Storage, Amazon S3, or a SQL database) and modify your Flask app to store the raw data into it.
Transform the Data with Pandas
Extract data from the database/lake, use Pandas to transform it into a clean, tabular format, and verify the results.
Host your Microservice on the Cloud
Deploy your Flask microservice on the cloud VM and ensure it fetches updated data at regular intervals.
Host your Code on GitHub
Create a GitHub repository for your project, commit and push your code, and provide comprehensive documentation in the README.
Export for Data Analysis (Optional)
Choose a destination for exporting data (e.g., Google Sheets, BigQuery) and modify your Flask app to export transformed data for analysis.
Basic Web Scraper
   Quick Project  Â
This project involves writing a Python script that scrapes data from a website using libraries such as BeautifulSoup and Requests.
Completing this project demonstrates your proficiency in...
Suggestion by miscbits
🧠 Before you start...
You will need a good understanding of the following:
Python
Flask
Pandas
You will need access to the following:
Cloud Provider account (GCP, AWS, Azure etc.)
GitHub account
🪜 Project Guide
Store the scraped data in a JSON or CSV file.
Find...
Projects for the Easily Distracted
Suggestion by miscbits
Description:
These project ideas are tailored for individuals who may have limited time or attention spans. They are short in scope but still provide valuable experience in typical data engineering tasks.
Project Suggestions:
Extract and Transform Data from an API
ETL from CSV to a Database
Simple Web Scraper
Log Data Analysis
Social Media Data Collector
Data Validation and Cleanup
Benefits:
These projects are perfect for building a diverse portfolio while demonstrating your ability to handle various data engineering tasks efficiently.
By working on these projects, not only will you enhance your technical skills, but you'll also create tangible evidence of your expertise that can impress potential employers. So, roll up your sleeves and start building your data engineering portfolio today!