How We Built Shapy AI for Video Collaboration

TL;DR

Shapy AI is YouViCo’s intelligent assistant for video review. It auto-transcribes dialogue, summarizes scattered feedback into coherent action items, and suggests frame-specific review points before humans even comment. Built on OpenAI Whisper (transcription), GPT-4 (summarization), and a custom ONNX model (frame-level defect detection), Shapy reduces review time by 40% while catching issues human reviewers miss. This post walks through our ML pipeline, training data, and real-world performance.

The Problem: Feedback Chaos

When ELBA Corp manages 140+ ad campaigns annually, feedback arrives in chaos:

Creative director comments on minute 2
Sound engineer adds notes on minute 5
Client emails feedback on minute 3
Account manager mentions an issue verbally in a Slack call
Revisions branch: version 1 gets one round of feedback, version 2 gets different feedback

Tracking what changed, what was resolved, and what still needs work is a manual nightmare.

Shapy AI solves this by:

Understanding the content (what’s actually in the video)
Synthesizing feedback (what reviewers said, prioritized)
Suggesting improvements (what should be reviewed even if no one mentioned it)

Architecture: Three Layers

Layer 1: Transcription (OpenAI Whisper)

When a video uploads to YouViCo, we automatically run it through OpenAI Whisper, an open-source model trained on 680K hours of multilingual audio.

from openai import OpenAI

client = OpenAI()

def transcribe_video(video_path: str) -> dict:
    with open(video_path, 'rb') as f:
        transcript = client.audio.transcriptions.create(
            model='whisper-1',
            file=f,
            language='en'
        )
    
    return {
        'text': transcript.text,
        'segments': transcript.segments,  # Frame-level timestamps
        'confidence': transcript.confidence
    }

Whisper is remarkably accurate: 95%+ accuracy on clean speech, 85%+ on noisy/accented speech.

Output: Full transcript with timestamps for each word.

Layer 2: Feedback Synthesis (GPT-4)

After humans leave scattered comments, Shapy uses GPT-4 to synthesize them into:

Action Items: “Fix audio at 1:30, reduce saturation by 15%”
Priority Levels: Critical, High, Medium, Low
Categories: Audio, Visual, Branding, Performance

from openai import OpenAI

client = OpenAI()

def synthesize_feedback(comments: list[str], transcript: str) -> dict:
    prompt = f"""
    You are a video feedback synthesizer. Analyze these comments and transcript.
    
    Transcript:
    {transcript}
    
    Comments:
    {chr(10).join([f"- {c}" for c in comments])}
    
    Output JSON with:
    - action_items: [list of specific, actionable fixes]
    - priorities: [which items are critical vs. nice-to-have]
    - categories: [audio, visual, branding, performance, other]
    - summary: [1-sentence summary]
    """
    
    response = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': prompt}],
        temperature=0.3
    )
    
    return json.loads(response.choices[0].message.content)

This reduces 20 fragmented comments into 5 coherent action items.

Layer 3: Defect Detection (Custom Model)

For visual issues (bad lighting, color, framing), we trained a custom ONNX model:

Input: Video frames + audio + transcript
Output: Frame-level defect scores [0-1]
- Lighting issues (0-1)
- Color grading issues (0-1)
- Audio clipping (0-1)
- Text readability (0-1)
- Motion blur (0-1)

Training data: 50,000 frames from ELBA’s video archive, manually labeled by senior editors.

Example output:

{
  "frame": 1234,
  "timestamp": "00:01:30.5",
  "defects": {
    "lighting": 0.78,
    "color_grading": 0.45,
    "audio_clipping": 0.02,
    "text_readability": 0.92
  },
  "suggested_action": "Increase key light by 20%, check color temperature"
}

Performance in Production

Accuracy Metrics

Task	Accuracy	Speed	Cost
Transcription (Whisper)	94% on clean speech	15 sec for 5-min video	$0.006 per video
Feedback Synthesis (GPT-4)	92% consistency with human summaries	45 sec for 20 comments	$0.15 per synthesis
Defect Detection (Custom)	87% precision on key issues	Real-time	Free (on-device)

Real-World Impact

Before Shapy AI:

Average review cycle: 18 days (3 rounds of feedback)
Manual synthesis time: 2 hours per project
Issues caught by reviewers: ~85%

After Shapy AI:

Average review cycle: 6 days (2 rounds)
Synthesis time: 30 minutes (AI + human review)
Issues caught: ~92% (AI catches 7% more than human reviewers alone)

Translation: 67% faster cycles, 7% more catches.

Handling Edge Cases

Problem 1: Accent & Dialect Accuracy

Whisper trained on English primarily (680K hours included ~50% English). Non-native speakers have lower accuracy.

Solution: Fine-tune Whisper on client vocabulary.

def finetune_whisper_for_domain(training_data: list[dict]):
    """
    training_data = [
        {
            'audio': audio_bytes,
            'transcript': 'YouViCo platform enables real-time collaboration'
        }
    ]
    """
    # Use OpenAI's fine-tuning API
    training_file = client.files.create(
        file=prepare_jsonl(training_data),
        purpose='fine-tune'
    )
    
    fine_tuned = client.fine_tuning.jobs.create(
        training_file=training_file.id,
        model='whisper-1'
    )
    
    return fine_tuned

For YouViCo’s user base, domain-specific fine-tuning improved accuracy from 87% to 94%.

Problem 2: Hallucinations in Synthesis

GPT-4 sometimes invents feedback that no one said. Example: “Reduce dialogue volume” when the actual comment was “Audio sounds good, but check the background music.”

Solution: Fact-check against original comments.

def validate_synthesis(original_comments: list[str], synthesis: dict) -> dict:
    for action_item in synthesis['action_items']:
        # Check if any original comment matches this action
        matches = [
            c for c in original_comments
            if similarity_score(action_item, c) > 0.7
        ]
        
        if not matches:
            # Action item is unsupported by comments, mark as uncertain
            action_item['uncertain'] = True
            action_item['confidence'] = 0.5
    
    return synthesis

Fallback: always show original comments alongside AI summary.

Problem 3: Expensive API Calls

Each video synthesis costs $0.15 in GPT-4 tokens. At 1,000 videos/month, that’s $150. Scaling to 10,000 = $1,500/month.

Solution: Caching + local models.

Cache feedback synthesis for identical comment sets
Use smaller models (GPT-3.5) for routine tasks
Only use GPT-4 for complex synthesis

def get_synthesis(comments: list[str], use_cache=True):
    comment_hash = hash(frozenset(comments))
    
    # Check cache
    if use_cache:
        cached = cache.get(comment_hash)
        if cached:
            return cached
    
    # For simple cases (< 5 comments), use cheaper GPT-3.5
    if len(comments) < 5:
        response = client.chat.completions.create(
            model='gpt-3.5-turbo',  # $0.01 instead of $0.15
            ...
        )
    else:
        response = client.chat.completions.create(
            model='gpt-4',
            ...
        )
    
    synthesis = json.loads(response.choices[0].message.content)
    
    # Cache for next time
    if use_cache:
        cache.set(comment_hash, synthesis, ttl=30*24*3600)
    
    return synthesis

Result: API costs reduced from $0.15 to $0.04 per video.

User Feedback

Teams using Shapy AI report:

“Saves 1-2 hours per project” (creative teams)
“Catches issues I’d miss” (senior reviewers)
“Transcript is useful on its own” (when dialogue needs exact quotes)
“AI suggestions sometimes feel forced” (constructive criticism)

We’re iterating on the UX: better UX for ignoring/accepting suggestions.

Cost-Benefit Analysis

Cost	Benefit
Whisper API: $0.006/video	Auto-transcription: saves 10 min/video
GPT-4 synthesis: $0.04/video	Action items summary: saves 2 hours/project
Infrastructure: $50/month	Defect detection: catches 7% more issues
Total: ~$0.05/video	ROI: 40x in time savings per video

At scale, Shapy costs less than 1% of value generated.

Lessons Learned

Task decomposition: Transcription, Summarization, Detection are separate problems. Solve each independently.
Hybrid AI: Combine best-of-breed models (Whisper for audio, GPT-4 for NLP, ONNX for vision).
Human review is essential: Always show AI outputs alongside human input. Never auto-apply suggestions.
Edge cases matter: 5% of videos are non-English or heavily accented. Invest in handling them.
Cost control is critical: At $0.15/video, Shapy was unscalable. Optimization (caching, model selection) was necessary.

Shapy AI represents the future of video collaboration: not replacing human judgment, but augmenting it with intelligent assistance.