AI Genomics for Every Researcher

Protagona partnered with a nationally recognized pediatric cancer nonprofit to build a conversational AI platform that generates ready-to-run Jupyter notebooks from plain-language research requests — eliminating cloud expertise as a barrier to genomic discovery.

Industry

Nonprofit

Teams & Services

Cloud Architecture, AI/ML Engineering, Back-End, DevOps, Infrastructure as Code

Tech & Tools

AWS Bedrock, Amazon SageMaker Studio, AWS ECS Fargate, AWS Lambda, Amazon S3, Amazon DynamoDB, AWS CodeBuild, Amazon ECR, Amazon EFS, Application Load Balancer, Claude 3.5 Sonnet, Streamlit, Terraform, Python, Jupyter Notebooks

Key Data Points

Researchers generate complete, executable Jupyter notebooks for single-cell genomics analysis through a chat interface, with no cloud infrastructure knowledge required.

Each session receives a fully isolated, automatically provisioned SageMaker Studio workspace with embedded 7-day dataset access via presigned URLs — no AWS credentials needed.

Idle SageMaker resources are automatically reclaimed after 30 minutes of inactivity, keeping infrastructure lean without any manual intervention.

The Vision

A nationally recognized pediatric cancer nonprofit maintains one of the most comprehensive open repositories of pediatric cancer genomic data in the world. Their dedicated research data team's mission is to accelerate discovery by making that data as accessible and actionable as possible. Leadership understood that the value of their open-source atlas depended not just on data quality, but on how easily external researchers could actually analyze it. With a lean, donor-supported operation serving institutions with widely varying technical capabilities, the foundation set out to remove the cloud and coding barriers standing between researchers and scientific breakthroughs — and sought a technical partner who could turn that ambition into a working system.

The Goal

The goal was to deliver a proof-of-concept application enabling researchers to interact with the organization's genomic datasets through natural language, automatically generating ready-to-run Jupyter notebooks tailored to their specific analysis goals. The system needed to serve researchers without cloud expertise, provision compute environments on demand, and clean up resources automatically — keeping operations manageable for a resource-constrained nonprofit operating at the frontier of pediatric cancer research.

The Challenge

The core challenge was bridging a significant gap between sophisticated genomic datasets and a research community with uneven cloud and coding proficiency. Asking researchers to manually configure SageMaker environments, write single-cell analysis code, and manage AWS credentials would exclude a large portion of the intended audience and slow scientific progress across institutions.

‍

Building a system that dynamically provisions personalized cloud environments for an unpredictable number of concurrent researchers introduced real architectural complexity. Each session required full isolation to prevent data or workspace collisions, yet infrastructure had to scale without manual oversight and clean itself up to avoid runaway resource consumption. Connecting a conversational AI layer to live cloud infrastructure — while ensuring generated notebooks were syntactically valid, immediately executable, and contextually aware of the specific dataset selected — required careful orchestration across multiple AWS services. There was no margin for a broken researcher experience: a single failed notebook or misconfigured environment would erode trust in the platform and set back the nonprofit's broader accessibility mission.

The Solution

Protagona designed a containerized, event-driven application hosted on AWS ECS Fargate, fronted by an Application Load Balancer, presenting researchers with a Streamlit chat interface. Researchers select a dataset from the organization's repository and describe their desired analysis in plain language. AWS Bedrock, powered by Claude 3.5 Sonnet, processes each request using a structured tool-calling approach that enforces valid Jupyter notebook format (nbformat v4), ensuring every generated notebook is immediately executable without manual correction.

‍

To give each researcher a private, fully managed compute environment, the system dynamically provisions a dedicated SageMaker Studio user profile and JupyterLab space per session. Notebooks are stored in Amazon S3 and synchronized bidirectionally with SageMaker EFS via Lambda, so changes made inside Studio are reflected back to the AI layer for iterative refinement. Researchers receive a direct deep link into their personal Studio space, and dataset access is embedded in notebook cells as presigned URLs with a 7-day window — removing any need to configure AWS credentials. Session state is tracked in DynamoDB with a 24-hour TTL.

‍

Infrastructure lifecycle management was built in from the start. A Lambda janitor function monitors session activity and automatically removes idle SageMaker spaces and user profiles after 30 minutes. The entire stack is defined in Terraform with workspace-based multi-user deployment support, and a CodeBuild pipeline handles zero-downtime container updates — allowing the engineering team to evolve the application without managing Docker builds or manual deployments.

Natural Language Genomics Access

Researchers at institutions with no cloud expertise can now generate complete, executable single-cell genomics notebooks through a plain-language chat interface — removing the technical gatekeeping that previously limited who could engage with the dataset.

Zero-Credential, Isolated Workspaces

Every session receives a fully isolated SageMaker Studio environment with dataset access embedded as presigned URLs, eliminating credential management and shared-environment conflicts across concurrent researcher sessions.

Hands-Free Infrastructure Lifecycle

Automated Lambda-driven cleanup reclaims idle SageMaker resources after 30 minutes, while a Terraform-managed CodeBuild pipeline enables zero-downtime updates — keeping operational overhead proportional to a nonprofit's lean team.