Building an ETL Pipeline with AWS Glue
Walkthrough demo use case of generating synthetic data for a ML Pipeline
Read more →In this blog post, we walk through a demo of building a simple but realistic data pipeline using AWS Glue and SageMaker among some other basic AWS services.
In this demo, we simulate a bioinformatics research scenario where researchers observe the development of zebra fish over time to understand the early-life indicators of a specific health condition. Each fish is monitored through imaging and metadata collection during its early life, and we later observe whether or not it developed a biological condition.
Develop a machine learning pipeline that predicts the likelihood of a zebra fish developing a specific biological condition based on early-life imaging and associated metadata.
This simulates a real-world bioinformatics research project, with the potential for large-scale data ingestion, long-term studies, and multimodal data fusion — making it an ideal candidate for a scalable, cloud-native MLOps workflow.
To keep the demo cost-effective, we’ll use synthetic data and generate placeholder images via a Python script instead of using real biological datasets.