TL;DR
Shapy AI is YouViCo’s intelligent assistant for video review. It auto-transcribes dialogue, summarizes scattered feedback into coherent action items, and suggests frame-specific review points before humans even comment. Built on OpenAI Whisper (transcription), GPT-4 (summarization), and a custom ONNX model (frame-level defect detection), Shapy reduces review time by 40% while catching issues human reviewers miss. This post walks through our ML pipeline, training data, and real-world performance.
The Problem: Feedback Chaos
When ELBA Corp manages 140+ ad campaigns annually, feedback arrives in chaos:
- Creative director comments on minute 2
- Sound engineer adds notes on minute 5
- Client emails feedback on minute 3
- Account manager mentions an issue verbally in a Slack call
- Revisions branch: version 1 gets one round of feedback, version 2 gets different feedback
Tracking what changed, what was resolved, and what still needs work is a manual nightmare.
Shapy AI solves this by:
- Understanding the content (what’s actually in the video)
- Synthesizing feedback (what reviewers said, prioritized)
- Suggesting improvements (what should be reviewed even if no one mentioned it)
Architecture: Three Layers
Layer 1: Transcription (OpenAI Whisper)
When a video uploads to YouViCo, we automatically run it through OpenAI Whisper, an open-source model trained on 680K hours of multilingual audio.
from openai import OpenAI
client = OpenAI()
def transcribe_video(video_path: str) -> dict:
with open(video_path, 'rb') as f:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=f,
language='en'
)
return {
'text': transcript.text,
'segments': transcript.segments, # Frame-level timestamps
'confidence': transcript.confidence
}
Whisper is remarkably accurate: 95%+ accuracy on clean speech, 85%+ on noisy/accented speech.
Output: Full transcript with timestamps for each word.
Layer 2: Feedback Synthesis (GPT-4)
After humans leave scattered comments, Shapy uses GPT-4 to synthesize them into:
- Action Items: “Fix audio at 1:30, reduce saturation by 15%”
- Priority Levels: Critical, High, Medium, Low
- Categories: Audio, Visual, Branding, Performance
from openai import OpenAI
client = OpenAI()
def synthesize_feedback(comments: list[str], transcript: str) -> dict:
prompt = f"""
You are a video feedback synthesizer. Analyze these comments and transcript.
Transcript:
{transcript}
Comments:
{chr(10).join([f"- {c}" for c in comments])}
Output JSON with:
- action_items: [list of specific, actionable fixes]
- priorities: [which items are critical vs. nice-to-have]
- categories: [audio, visual, branding, performance, other]
- summary: [1-sentence summary]
"""
response = client.chat.completions.create(
model='gpt-4',
messages=[{'role': 'user', 'content': prompt}],
temperature=0.3
)
return json.loads(response.choices[0].message.content)
This reduces 20 fragmented comments into 5 coherent action items.
Layer 3: Defect Detection (Custom Model)
For visual issues (bad lighting, color, framing), we trained a custom ONNX model:
Input: Video frames + audio + transcript
Output: Frame-level defect scores [0-1]
- Lighting issues (0-1)
- Color grading issues (0-1)
- Audio clipping (0-1)
- Text readability (0-1)
- Motion blur (0-1)
Training data: 50,000 frames from ELBA’s video archive, manually labeled by senior editors.
Example output:
{
"frame": 1234,
"timestamp": "00:01:30.5",
"defects": {
"lighting": 0.78,
"color_grading": 0.45,
"audio_clipping": 0.02,
"text_readability": 0.92
},
"suggested_action": "Increase key light by 20%, check color temperature"
}
Performance in Production
Accuracy Metrics
| Task | Accuracy | Speed | Cost |
|---|---|---|---|
| Transcription (Whisper) | 94% on clean speech | 15 sec for 5-min video | $0.006 per video |
| Feedback Synthesis (GPT-4) | 92% consistency with human summaries | 45 sec for 20 comments | $0.15 per synthesis |
| Defect Detection (Custom) | 87% precision on key issues | Real-time | Free (on-device) |
Real-World Impact
Before Shapy AI:
- Average review cycle: 18 days (3 rounds of feedback)
- Manual synthesis time: 2 hours per project
- Issues caught by reviewers: ~85%
After Shapy AI:
- Average review cycle: 6 days (2 rounds)
- Synthesis time: 30 minutes (AI + human review)
- Issues caught: ~92% (AI catches 7% more than human reviewers alone)
Translation: 67% faster cycles, 7% more catches.
Handling Edge Cases
Problem 1: Accent & Dialect Accuracy
Whisper trained on English primarily (680K hours included ~50% English). Non-native speakers have lower accuracy.
Solution: Fine-tune Whisper on client vocabulary.
def finetune_whisper_for_domain(training_data: list[dict]):
"""
training_data = [
{
'audio': audio_bytes,
'transcript': 'YouViCo platform enables real-time collaboration'
}
]
"""
# Use OpenAI's fine-tuning API
training_file = client.files.create(
file=prepare_jsonl(training_data),
purpose='fine-tune'
)
fine_tuned = client.fine_tuning.jobs.create(
training_file=training_file.id,
model='whisper-1'
)
return fine_tuned
For YouViCo’s user base, domain-specific fine-tuning improved accuracy from 87% to 94%.
Problem 2: Hallucinations in Synthesis
GPT-4 sometimes invents feedback that no one said. Example: “Reduce dialogue volume” when the actual comment was “Audio sounds good, but check the background music.”
Solution: Fact-check against original comments.
def validate_synthesis(original_comments: list[str], synthesis: dict) -> dict:
for action_item in synthesis['action_items']:
# Check if any original comment matches this action
matches = [
c for c in original_comments
if similarity_score(action_item, c) > 0.7
]
if not matches:
# Action item is unsupported by comments, mark as uncertain
action_item['uncertain'] = True
action_item['confidence'] = 0.5
return synthesis
Fallback: always show original comments alongside AI summary.
Problem 3: Expensive API Calls
Each video synthesis costs $0.15 in GPT-4 tokens. At 1,000 videos/month, that’s $150. Scaling to 10,000 = $1,500/month.
Solution: Caching + local models.
- Cache feedback synthesis for identical comment sets
- Use smaller models (GPT-3.5) for routine tasks
- Only use GPT-4 for complex synthesis
def get_synthesis(comments: list[str], use_cache=True):
comment_hash = hash(frozenset(comments))
# Check cache
if use_cache:
cached = cache.get(comment_hash)
if cached:
return cached
# For simple cases (< 5 comments), use cheaper GPT-3.5
if len(comments) < 5:
response = client.chat.completions.create(
model='gpt-3.5-turbo', # $0.01 instead of $0.15
...
)
else:
response = client.chat.completions.create(
model='gpt-4',
...
)
synthesis = json.loads(response.choices[0].message.content)
# Cache for next time
if use_cache:
cache.set(comment_hash, synthesis, ttl=30*24*3600)
return synthesis
Result: API costs reduced from $0.15 to $0.04 per video.
User Feedback
Teams using Shapy AI report:
- “Saves 1-2 hours per project” (creative teams)
- “Catches issues I’d miss” (senior reviewers)
- “Transcript is useful on its own” (when dialogue needs exact quotes)
- “AI suggestions sometimes feel forced” (constructive criticism)
We’re iterating on the UX: better UX for ignoring/accepting suggestions.
Cost-Benefit Analysis
| Cost | Benefit |
|---|---|
| Whisper API: $0.006/video | Auto-transcription: saves 10 min/video |
| GPT-4 synthesis: $0.04/video | Action items summary: saves 2 hours/project |
| Infrastructure: $50/month | Defect detection: catches 7% more issues |
| Total: ~$0.05/video | ROI: 40x in time savings per video |
At scale, Shapy costs less than 1% of value generated.
Lessons Learned
-
Task decomposition: Transcription, Summarization, Detection are separate problems. Solve each independently.
-
Hybrid AI: Combine best-of-breed models (Whisper for audio, GPT-4 for NLP, ONNX for vision).
-
Human review is essential: Always show AI outputs alongside human input. Never auto-apply suggestions.
-
Edge cases matter: 5% of videos are non-English or heavily accented. Invest in handling them.
-
Cost control is critical: At $0.15/video, Shapy was unscalable. Optimization (caching, model selection) was necessary.
Shapy AI represents the future of video collaboration: not replacing human judgment, but augmenting it with intelligent assistance.