Roles & Stack
Impact
- Automated extraction of 4+ clinical data fields per transcription
- Structured output with ICD-10 code matching
- JSON schema validation for data integrity
- Batch processing capability for large datasets
Year
2025
Category
AI & Machine Learning
Deliverables
- Python Data Extraction Pipeline
- OpenAI API Integration
- Structured JSON Output Schema
- Jupyter Notebook Documentation
The Situation: Unstructured Clinical Data
Healthcare organizations generate massive volumes of clinical transcriptions daily. These free-form text documents contain valuable patient information—diagnoses, treatments, demographics—but extracting this data manually is expensive, slow, and error-prone.
I was tasked with building an automated pipeline that could parse raw transcriptions and output structured, validated JSON suitable for analytics and billing systems. The key challenge: medical language is nuanced, full of abbreviations, and requires domain knowledge to interpret correctly.
The Problem: Manual Extraction Doesn't Scale
Traditional approaches to this problem involve either expensive human coders who manually review each transcription, or rigid rule-based systems that fail on edge cases. Both approaches suffer from significant limitations.
Human coders are slow, expensive, and prone to fatigue-related errors. Rule-based NLP systems break when doctors use non-standard phrasing or new terminology. The result is a backlog of unprocessed data and inconsistent output quality.
The Solution: LLM-Powered Extraction Pipeline
I built a Python pipeline using OpenAI's GPT-4o-mini with JSON mode to ensure structured, validated output. The key insight was using a carefully engineered prompt that instructs the model to extract specific fields while adhering to medical coding standards.
The system prompt defines the exact JSON schema expected, including field types and validation rules. This ensures the output is always machine-readable and consistent across thousands of transcriptions.
Core Extraction Function
The extraction function sends each transcription to the API with a carefully crafted system prompt. JSON mode ensures the response is always parseable:
| 1 | import openai |
| 2 | import json |
| 3 | from typing import Optional |
| 4 | |
| 5 | def extract_medical_info(transcription: str) -> dict: |
| 6 | """Extract structured medical data from a transcription.""" |
| 7 | |
| 8 | system_prompt = """You are a medical data extraction assistant. |
| 9 | Extract the following fields as valid JSON: |
| 10 | - patient_age: integer or null if not mentioned |
| 11 | - medical_specialty: string (e.g., "Cardiology") |
| 12 | - treatment_plan: string summary of treatment |
| 13 | - icd10_codes: list of applicable ICD-10 codes |
| 14 | - confidence_score: float between 0.0 and 1.0 |
| 15 | |
| 16 | Only output valid JSON. No explanations.""" |
| 17 | |
| 18 | response = openai.chat.completions.create( |
| 19 | model="gpt-4o-mini", |
| 20 | response_format={"type": "json_object"}, |
| 21 | messages=[ |
| 22 | {"role": "system", "content": system_prompt}, |
| 23 | {"role": "user", "content": transcription} |
| 24 | ], |
| 25 | temperature=0.1 # Low temperature for consistent extraction |
| 26 | ) |
| 27 | |
| 28 | return json.loads(response.choices[0].message.content) |
Batch Processing with Pandas
For production use, the pipeline processes entire datasets using Pandas. Each transcription is passed through the extraction function, with proper error handling and progress tracking:
| 1 | import pandas as pd |
| 2 | from tqdm import tqdm |
| 3 | |
| 4 | def process_transcriptions(input_file: str, output_file: str): |
| 5 | """Process a CSV of transcriptions and save extracted data.""" |
| 6 | |
| 7 | df = pd.read_csv(input_file) |
| 8 | print(f"Processing {len(df)} transcriptions...") |
| 9 | |
| 10 | results = [] |
| 11 | for idx, row in tqdm(df.iterrows(), total=len(df)): |
| 12 | try: |
| 13 | extracted = extract_medical_info(row['transcription']) |
| 14 | extracted['source_id'] = row['id'] |
| 15 | results.append(extracted) |
| 16 | except Exception as e: |
| 17 | print(f"Error processing row {idx}: {e}") |
| 18 | results.append({'source_id': row['id'], 'error': str(e)}) |
| 19 | |
| 20 | output_df = pd.DataFrame(results) |
| 21 | output_df.to_csv(output_file, index=False) |
| 22 | print(f"Saved {len(output_df)} results to {output_file}") |
| 23 | |
| 24 | # Run the pipeline |
| 25 | process_transcriptions('transcriptions.csv', 'extracted_data.csv') |
Output Validation & Flattening
After extraction, we validate and flatten the nested JSON into tabular format for analysis. The ICD-10 codes list is converted to a comma-separated string for CSV compatibility:
| 1 | # Validate and flatten the extracted data |
| 2 | df = pd.read_csv('extracted_data.csv') |
| 3 | |
| 4 | # Convert ICD-10 codes list to string |
| 5 | df['icd10_codes_str'] = df['icd10_codes'].apply( |
| 6 | lambda x: ', '.join(eval(x)) if pd.notna(x) else '' |
| 7 | ) |
| 8 | |
| 9 | # Filter high-confidence extractions |
| 10 | high_confidence = df[df['confidence_score'] >= 0.8] |
| 11 | print(f"High-confidence extractions: {len(high_confidence)}") |
| 12 | |
| 13 | # Summary statistics |
| 14 | print(f"\nSpecialty distribution:") |
| 15 | print(df['medical_specialty'].value_counts().head(10)) |
| 16 | |
| 17 | print(f"\nAverage patient age: {df['patient_age'].mean():.1f}") |
| 18 | print(f"Total ICD-10 codes extracted: {df['icd10_codes_str'].str.count(',').sum() + len(df)}") |
Impact & Results
The pipeline successfully processes transcriptions with high accuracy, extracting patient demographics, specialty classifications, treatment summaries, and ICD-10 codes. The JSON mode ensures output is always parseable and valid.
This project demonstrates that LLMs can be reliable tools for healthcare data processing when properly constrained with structured output formats. The approach is extensible to other medical document types and can be integrated into existing healthcare IT infrastructure.
Gallery
Process visuals
Explore next
AI Travel Guide for Paris
An AI-powered travel assistant that leverages OpenAI's GPT models to provide personalized, expert-level travel recommendations for Paris. Using advanced prompt engineering techniques, the system maintains conversational context and delivers authentic Parisian insights—from hidden gems to local cuisine recommendations.