Skip to content

How We Built Shapy AI for Video Collaboration

TL;DR

Shapy AI is YouViCo’s intelligent assistant for video review. It auto-transcribes dialogue, summarizes scattered feedback into coherent action items, and suggests frame-specific review points before humans even comment. Built on OpenAI Whisper (transcription), GPT-4 (summarization), and a custom ONNX model (frame-level defect detection), Shapy reduces review time by 40% while catching issues human reviewers miss. This post walks through our ML pipeline, training data, and real-world performance.

The Problem: Feedback Chaos

When ELBA Corp manages 140+ ad campaigns annually, feedback arrives in chaos:

Tracking what changed, what was resolved, and what still needs work is a manual nightmare.

Shapy AI solves this by:

  1. Understanding the content (what’s actually in the video)
  2. Synthesizing feedback (what reviewers said, prioritized)
  3. Suggesting improvements (what should be reviewed even if no one mentioned it)

Architecture: Three Layers

Layer 1: Transcription (OpenAI Whisper)

When a video uploads to YouViCo, we automatically run it through OpenAI Whisper, an open-source model trained on 680K hours of multilingual audio.

from openai import OpenAI

client = OpenAI()

def transcribe_video(video_path: str) -> dict:
    with open(video_path, 'rb') as f:
        transcript = client.audio.transcriptions.create(
            model='whisper-1',
            file=f,
            language='en'
        )
    
    return {
        'text': transcript.text,
        'segments': transcript.segments,  # Frame-level timestamps
        'confidence': transcript.confidence
    }

Whisper is remarkably accurate: 95%+ accuracy on clean speech, 85%+ on noisy/accented speech.

Output: Full transcript with timestamps for each word.

Layer 2: Feedback Synthesis (GPT-4)

After humans leave scattered comments, Shapy uses GPT-4 to synthesize them into:

  1. Action Items: “Fix audio at 1:30, reduce saturation by 15%”
  2. Priority Levels: Critical, High, Medium, Low
  3. Categories: Audio, Visual, Branding, Performance
from openai import OpenAI

client = OpenAI()

def synthesize_feedback(comments: list[str], transcript: str) -> dict:
    prompt = f"""
    You are a video feedback synthesizer. Analyze these comments and transcript.
    
    Transcript:
    {transcript}
    
    Comments:
    {chr(10).join([f"- {c}" for c in comments])}
    
    Output JSON with:
    - action_items: [list of specific, actionable fixes]
    - priorities: [which items are critical vs. nice-to-have]
    - categories: [audio, visual, branding, performance, other]
    - summary: [1-sentence summary]
    """
    
    response = client.chat.completions.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': prompt}],
        temperature=0.3
    )
    
    return json.loads(response.choices[0].message.content)

This reduces 20 fragmented comments into 5 coherent action items.

Layer 3: Defect Detection (Custom Model)

For visual issues (bad lighting, color, framing), we trained a custom ONNX model:

Input: Video frames + audio + transcript
Output: Frame-level defect scores [0-1]
- Lighting issues (0-1)
- Color grading issues (0-1)
- Audio clipping (0-1)
- Text readability (0-1)
- Motion blur (0-1)

Training data: 50,000 frames from ELBA’s video archive, manually labeled by senior editors.

Example output:

{
  "frame": 1234,
  "timestamp": "00:01:30.5",
  "defects": {
    "lighting": 0.78,
    "color_grading": 0.45,
    "audio_clipping": 0.02,
    "text_readability": 0.92
  },
  "suggested_action": "Increase key light by 20%, check color temperature"
}

Performance in Production

Accuracy Metrics

TaskAccuracySpeedCost
Transcription (Whisper)94% on clean speech15 sec for 5-min video$0.006 per video
Feedback Synthesis (GPT-4)92% consistency with human summaries45 sec for 20 comments$0.15 per synthesis
Defect Detection (Custom)87% precision on key issuesReal-timeFree (on-device)

Real-World Impact

Before Shapy AI:

After Shapy AI:

Translation: 67% faster cycles, 7% more catches.

Handling Edge Cases

Problem 1: Accent & Dialect Accuracy

Whisper trained on English primarily (680K hours included ~50% English). Non-native speakers have lower accuracy.

Solution: Fine-tune Whisper on client vocabulary.

def finetune_whisper_for_domain(training_data: list[dict]):
    """
    training_data = [
        {
            'audio': audio_bytes,
            'transcript': 'YouViCo platform enables real-time collaboration'
        }
    ]
    """
    # Use OpenAI's fine-tuning API
    training_file = client.files.create(
        file=prepare_jsonl(training_data),
        purpose='fine-tune'
    )
    
    fine_tuned = client.fine_tuning.jobs.create(
        training_file=training_file.id,
        model='whisper-1'
    )
    
    return fine_tuned

For YouViCo’s user base, domain-specific fine-tuning improved accuracy from 87% to 94%.

Problem 2: Hallucinations in Synthesis

GPT-4 sometimes invents feedback that no one said. Example: “Reduce dialogue volume” when the actual comment was “Audio sounds good, but check the background music.”

Solution: Fact-check against original comments.

def validate_synthesis(original_comments: list[str], synthesis: dict) -> dict:
    for action_item in synthesis['action_items']:
        # Check if any original comment matches this action
        matches = [
            c for c in original_comments
            if similarity_score(action_item, c) > 0.7
        ]
        
        if not matches:
            # Action item is unsupported by comments, mark as uncertain
            action_item['uncertain'] = True
            action_item['confidence'] = 0.5
    
    return synthesis

Fallback: always show original comments alongside AI summary.

Problem 3: Expensive API Calls

Each video synthesis costs $0.15 in GPT-4 tokens. At 1,000 videos/month, that’s $150. Scaling to 10,000 = $1,500/month.

Solution: Caching + local models.

  1. Cache feedback synthesis for identical comment sets
  2. Use smaller models (GPT-3.5) for routine tasks
  3. Only use GPT-4 for complex synthesis
def get_synthesis(comments: list[str], use_cache=True):
    comment_hash = hash(frozenset(comments))
    
    # Check cache
    if use_cache:
        cached = cache.get(comment_hash)
        if cached:
            return cached
    
    # For simple cases (< 5 comments), use cheaper GPT-3.5
    if len(comments) < 5:
        response = client.chat.completions.create(
            model='gpt-3.5-turbo',  # $0.01 instead of $0.15
            ...
        )
    else:
        response = client.chat.completions.create(
            model='gpt-4',
            ...
        )
    
    synthesis = json.loads(response.choices[0].message.content)
    
    # Cache for next time
    if use_cache:
        cache.set(comment_hash, synthesis, ttl=30*24*3600)
    
    return synthesis

Result: API costs reduced from $0.15 to $0.04 per video.

User Feedback

Teams using Shapy AI report:

We’re iterating on the UX: better UX for ignoring/accepting suggestions.

Cost-Benefit Analysis

CostBenefit
Whisper API: $0.006/videoAuto-transcription: saves 10 min/video
GPT-4 synthesis: $0.04/videoAction items summary: saves 2 hours/project
Infrastructure: $50/monthDefect detection: catches 7% more issues
Total: ~$0.05/videoROI: 40x in time savings per video

At scale, Shapy costs less than 1% of value generated.

Lessons Learned

  1. Task decomposition: Transcription, Summarization, Detection are separate problems. Solve each independently.

  2. Hybrid AI: Combine best-of-breed models (Whisper for audio, GPT-4 for NLP, ONNX for vision).

  3. Human review is essential: Always show AI outputs alongside human input. Never auto-apply suggestions.

  4. Edge cases matter: 5% of videos are non-English or heavily accented. Invest in handling them.

  5. Cost control is critical: At $0.15/video, Shapy was unscalable. Optimization (caching, model selection) was necessary.

Shapy AI represents the future of video collaboration: not replacing human judgment, but augmenting it with intelligent assistance.

Ready to streamline your video collaboration?

Get started for free