
Proving the "Impossible" Possible: AI Document Processing for Congressional Filings
Protagona partnered with a leading nonpartisan money-in-politics organization to build an AI-powered document processing pipeline on AWS, automating extraction from the most hostile congressional financial disclosures — including handwritten, degraded, and multi-hundred-page filings.
Industry
Nonprofit
Teams & Services
Data Engineering, Cloud Architecture, AI/ML, Delivery Management
Tech & Tools
AWS Bedrock, Amazon Textract, AWS Lambda, Amazon S3, AWS Step Functions, Amazon Bedrock Agents, Multi-modal Foundation Models
Key Data Points
The Vision
One of the most trusted nonpartisan sources of money-in-politics data in the United States, this organization tracks campaign contributions, lobbying activity, and the personal financial disclosures of elected officials. Their credibility depends entirely on the accuracy of what they publish. For years, processing mandatory congressional financial disclosures required dedicated researchers working manually through documents — some hundreds of pages long, some handwritten and deliberately degraded through repeated photocopying. They needed a partner who could prove intelligent document processing was viable before committing to a full build.
The Goal
Protagona was engaged to achieve three concrete objectives: prove that an AI-powered pipeline could extract structured financial data from the full range of congressional disclosure formats, including the most difficult handwritten and degraded documents; implement confidence scoring that routes uncertain extractions to human review before they reach the public dataset; and deliver a working proof of concept deployed inside the organization's own AWS environment within three weeks.
The Challenge
Congressional financial disclosures are among the most hostile documents for automated processing. Filings range from clean machine-typed forms to handwritten submissions run through a photocopier until text becomes ambiguous — submitted upside down, buried inside brokerage statement attachments from dozens of financial institutions, each with its own format. Some exceed three hundred pages. The organization's own technical leadership had previously attempted extraction with available tools and concluded it could not be done reliably. That skepticism defined the engagement's starting point. The accuracy bar was unambiguous: data published for journalists, researchers, and the public is treated as factual record, and any error reaching the platform would damage institutional credibility built over decades.
The Solution
Protagona designed an intelligent document processing pipeline that automatically extracts financial data the moment a filing is uploaded. A coordinating AI agent breaks each document into stages — extraction, validation, and confidence assessment — and routes each one to the right tool for the job. Standard text and structure are extracted directly, while handwritten entries and degraded scans, including documents blurred through repeated photocopying, are processed using AI models built to interpret visual content beyond what traditional text recognition can handle.
The confidence-scoring system was the strategic centerpiece of the design. Rather than treating every extraction the same way, the pipeline scores each data point individually — high-confidence results move straight through automated processing, while anything uncertain is flagged for human review before it reaches the public dataset. This preserves the accuracy the organization depends on without requiring every filing to be reviewed from scratch. The full system was delivered with complete documentation, so the internal team can operate and extend it independently.
.png)
