Medical Transcriptions with OpenAI API

Roles & Stack

AI EngineerData Scientist

PythonOpenAI APIPandasJSON ModeJupyter Notebook

Impact

Automated extraction of 4+ clinical data fields per transcription
Structured output with ICD-10 code matching
JSON schema validation for data integrity
Batch processing capability for large datasets

Year

2025

The Situation: Unstructured Clinical Data

Healthcare organizations generate massive volumes of clinical transcriptions daily. These free-form text documents contain valuable patient information—diagnoses, treatments, demographics—but extracting this data manually is expensive, slow, and error-prone.

I was tasked with building an automated pipeline that could parse raw transcriptions and output structured, validated JSON suitable for analytics and billing systems. The key challenge: medical language is nuanced, full of abbreviations, and requires domain knowledge to interpret correctly.

High-Volume Medical Data ProcessingNLP for Healthcare ApplicationsStructured Output Requirements

The Problem: Manual Extraction Doesn't Scale

Traditional approaches to this problem involve either expensive human coders who manually review each transcription, or rigid rule-based systems that fail on edge cases. Both approaches suffer from significant limitations.

Human coders are slow, expensive, and prone to fatigue-related errors. Rule-based NLP systems break when doctors use non-standard phrasing or new terminology. The result is a backlog of unprocessed data and inconsistent output quality.

Human Coding BottleneckRule-Based System BrittlenessInconsistent Data QualityProcessing Backlog

The Solution: LLM-Powered Extraction Pipeline

I built a Python pipeline using OpenAI's GPT-4o-mini with JSON mode to ensure structured, validated output. The key insight was using a carefully engineered prompt that instructs the model to extract specific fields while adhering to medical coding standards.

The system prompt defines the exact JSON schema expected, including field types and validation rules. This ensures the output is always machine-readable and consistent across thousands of transcriptions.

GPT-4o-mini for Cost-Effective ProcessingJSON Mode for Schema EnforcementPrompt Engineering for Medical Domain

Core Extraction Function

The extraction function sends each transcription to the API with a carefully crafted system prompt. JSON mode ensures the response is always parseable:

python

1	import openai
2	import json
3	from typing import Optional
4
5	def extract_medical_info(transcription: str) -> dict:
6	"""Extract structured medical data from a transcription."""
7
8	system_prompt = """You are a medical data extraction assistant.
9	Extract the following fields as valid JSON:
10	- patient_age: integer or null if not mentioned
11	- medical_specialty: string (e.g., "Cardiology")
12	- treatment_plan: string summary of treatment
13	- icd10_codes: list of applicable ICD-10 codes
14	- confidence_score: float between 0.0 and 1.0
15
16	Only output valid JSON. No explanations."""
17
18	response = openai.chat.completions.create(
19	model="gpt-4o-mini",
20	response_format={"type": "json_object"},
21	messages=[
22	{"role": "system", "content": system_prompt},
23	{"role": "user", "content": transcription}
24	],
25	temperature=0.1 # Low temperature for consistent extraction
26	)
27
28	return json.loads(response.choices[0].message.content)

JSON Mode APISchema-Enforced OutputLow Temperature for Consistency

Batch Processing with Pandas

For production use, the pipeline processes entire datasets using Pandas. Each transcription is passed through the extraction function, with proper error handling and progress tracking:

python

1	import pandas as pd
2	from tqdm import tqdm
3
4	def process_transcriptions(input_file: str, output_file: str):
5	"""Process a CSV of transcriptions and save extracted data."""
6
7	df = pd.read_csv(input_file)
8	print(f"Processing {len(df)} transcriptions...")
9
10	results = []
11	for idx, row in tqdm(df.iterrows(), total=len(df)):
12	try:
13	extracted = extract_medical_info(row['transcription'])
14	extracted['source_id'] = row['id']
15	results.append(extracted)
16	except Exception as e:
17	print(f"Error processing row {idx}: {e}")
18	results.append({'source_id': row['id'], 'error': str(e)})
19
20	output_df = pd.DataFrame(results)
21	output_df.to_csv(output_file, index=False)
22	print(f"Saved {len(output_df)} results to {output_file}")
23
24	# Run the pipeline
25	process_transcriptions('transcriptions.csv', 'extracted_data.csv')

Pandas DataFrame ProcessingProgress Tracking with tqdmError Handling & Recovery

Output Validation & Flattening

After extraction, we validate and flatten the nested JSON into tabular format for analysis. The ICD-10 codes list is converted to a comma-separated string for CSV compatibility:

python

1	# Validate and flatten the extracted data
2	df = pd.read_csv('extracted_data.csv')
3
4	# Convert ICD-10 codes list to string
5	df['icd10_codes_str'] = df['icd10_codes'].apply(
6	lambda x: ', '.join(eval(x)) if pd.notna(x) else ''
7	)
8
9	# Filter high-confidence extractions
10	high_confidence = df[df['confidence_score'] >= 0.8]
11	print(f"High-confidence extractions: {len(high_confidence)}")
12
13	# Summary statistics
14	print(f"\nSpecialty distribution:")
15	print(df['medical_specialty'].value_counts().head(10))
16
17	print(f"\nAverage patient age: {df['patient_age'].mean():.1f}")
18	print(f"Total ICD-10 codes extracted: {df['icd10_codes_str'].str.count(',').sum() + len(df)}")

Data Validation PipelineConfidence FilteringSummary Statistics

Impact & Results

The pipeline successfully processes transcriptions with high accuracy, extracting patient demographics, specialty classifications, treatment summaries, and ICD-10 codes. The JSON mode ensures output is always parseable and valid.

This project demonstrates that LLMs can be reliable tools for healthcare data processing when properly constrained with structured output formats. The approach is extensible to other medical document types and can be integrated into existing healthcare IT infrastructure.

High Extraction AccuracyConsistent Structured OutputExtensible ArchitectureHealthcare IT Integration Ready

Gallery

Process visuals

Explore next

AI Travel Guide for Paris

An AI-powered travel assistant that leverages OpenAI's GPT models to provide personalized, expert-level travel recommendations for Paris. Using advanced prompt engineering techniques, the system maintains conversational context and delivers authentic Parisian insights—from hidden gems to local cuisine recommendations.

Read the next case study

Roles & Stack

AI EngineerData Scientist

PythonOpenAI APIPandasJSON ModeJupyter Notebook

Impact

Automated extraction of 4+ clinical data fields per transcription
Structured output with ICD-10 code matching
JSON schema validation for data integrity
Batch processing capability for large datasets

Year

2025

The Situation: Unstructured Clinical Data

High-Volume Medical Data ProcessingNLP for Healthcare ApplicationsStructured Output Requirements

The Problem: Manual Extraction Doesn't Scale

Human Coding BottleneckRule-Based System BrittlenessInconsistent Data QualityProcessing Backlog

The Solution: LLM-Powered Extraction Pipeline

GPT-4o-mini for Cost-Effective ProcessingJSON Mode for Schema EnforcementPrompt Engineering for Medical Domain

Core Extraction Function

The extraction function sends each transcription to the API with a carefully crafted system prompt. JSON mode ensures the response is always parseable:

python

1	import openai
2	import json
3	from typing import Optional
4
5	def extract_medical_info(transcription: str) -> dict:
6	"""Extract structured medical data from a transcription."""
7
8	system_prompt = """You are a medical data extraction assistant.
9	Extract the following fields as valid JSON:
10	- patient_age: integer or null if not mentioned
11	- medical_specialty: string (e.g., "Cardiology")
12	- treatment_plan: string summary of treatment
13	- icd10_codes: list of applicable ICD-10 codes
14	- confidence_score: float between 0.0 and 1.0
15
16	Only output valid JSON. No explanations."""
17
18	response = openai.chat.completions.create(
19	model="gpt-4o-mini",
20	response_format={"type": "json_object"},
21	messages=[
22	{"role": "system", "content": system_prompt},
23	{"role": "user", "content": transcription}
24	],
25	temperature=0.1 # Low temperature for consistent extraction
26	)
27
28	return json.loads(response.choices[0].message.content)

JSON Mode APISchema-Enforced OutputLow Temperature for Consistency

Batch Processing with Pandas

For production use, the pipeline processes entire datasets using Pandas. Each transcription is passed through the extraction function, with proper error handling and progress tracking:

python

1	import pandas as pd
2	from tqdm import tqdm
3
4	def process_transcriptions(input_file: str, output_file: str):
5	"""Process a CSV of transcriptions and save extracted data."""
6
7	df = pd.read_csv(input_file)
8	print(f"Processing {len(df)} transcriptions...")
9
10	results = []
11	for idx, row in tqdm(df.iterrows(), total=len(df)):
12	try:
13	extracted = extract_medical_info(row['transcription'])
14	extracted['source_id'] = row['id']
15	results.append(extracted)
16	except Exception as e:
17	print(f"Error processing row {idx}: {e}")
18	results.append({'source_id': row['id'], 'error': str(e)})
19
20	output_df = pd.DataFrame(results)
21	output_df.to_csv(output_file, index=False)
22	print(f"Saved {len(output_df)} results to {output_file}")
23
24	# Run the pipeline
25	process_transcriptions('transcriptions.csv', 'extracted_data.csv')

Pandas DataFrame ProcessingProgress Tracking with tqdmError Handling & Recovery

Output Validation & Flattening

After extraction, we validate and flatten the nested JSON into tabular format for analysis. The ICD-10 codes list is converted to a comma-separated string for CSV compatibility:

python

1	# Validate and flatten the extracted data
2	df = pd.read_csv('extracted_data.csv')
3
4	# Convert ICD-10 codes list to string
5	df['icd10_codes_str'] = df['icd10_codes'].apply(
6	lambda x: ', '.join(eval(x)) if pd.notna(x) else ''
7	)
8
9	# Filter high-confidence extractions
10	high_confidence = df[df['confidence_score'] >= 0.8]
11	print(f"High-confidence extractions: {len(high_confidence)}")
12
13	# Summary statistics
14	print(f"\nSpecialty distribution:")
15	print(df['medical_specialty'].value_counts().head(10))
16
17	print(f"\nAverage patient age: {df['patient_age'].mean():.1f}")
18	print(f"Total ICD-10 codes extracted: {df['icd10_codes_str'].str.count(',').sum() + len(df)}")

Data Validation PipelineConfidence FilteringSummary Statistics

Impact & Results

High Extraction AccuracyConsistent Structured OutputExtensible ArchitectureHealthcare IT Integration Ready

Gallery

Process visuals

Explore next

AI Travel Guide for Paris

Read the next case study