[GSOC-18 Proposal] Automated video segmentation
#1
Goals: Given a video, break it down into a series of component parts. Mark these parts using the chapters/bookmarks system.
Parts could include things like:
  • Prologue
  • Intro
  • part 1
  • Titlecard
  • part 2
  • Outro
  • Credits
Here. I'm operating on the assumption that these parts are fairly consistent across all shows, ie that 99% of shows follow the prologue -> intro -> part 1 -> titlecard -> part 2 -> outro pattern, or something similar(which seems true from what I've seen, but then again I only really watch anime, so what do I know).

Stretch goal: Build a platform making it easier to perform this type of analysis. This could be useful if someone wants to build a system to perform genre detection, for example.

Benefits: This could help anyone with video files that aren't properly segmented, making it easier to navigate around the video. A few use cases, for example:
  • Skipping intros and outros
  • Jumping to a song in a musical

What does it touch in Kodi:
  • Reading the audio and video of the file
  • Bookmarks and chapters

Requirements: Machine learning knowledge, Python and C++, any computer with a decent Nvidia GPU(or a cloud equivalent).

Possible mentors: Razze

Methods: This project would be implemented through a deep learning model, specifically some combination of convolutional networks and recurrent networks. While the specific libraries and architectures need (much) more thought, here are a few initial suggestions:

Libraries(order of preference):
  • Tensorflow/Keras( (+)standard and easy to use (-) slow, requires an extra dependency)
  • PyTorch 
  • CNTK
Approach: The primary factors in deciding the approach are:
  • The ratio of decisions made by the model and by heuristics(our code).
  • The required granularity and accuracy.
  • The available data(copyright issues and all that). 
  • The number of classes.


Here's one basic end-to-end ML approach:

We model the task as a classification task, with the goal of classifying each moment as one of a set of classes.   
The model is a basic recurrent(LSTM) model, that, for each time step, inputs a frame of video and a sample of audio, and produces a softmax probability distribution across a set of classes, with one class for each type of part.
The output is then smoothed (possibly with a HMM) to get the final breakdown. 

Here's a slightly less ML-heavy approach:

We model the task as an edge detection task, with the goal of identifying boundaries between different segments.
The model is again a recurrent network, but this time has only one output, producing the likelihood of the current moment being an edge. The network is run across the video, producing a time-series of probabilities. This series is smoothed, segment boundaries are extracted, and the segments are then run through a classifier(convolutional) to figure out what type of segment it is. 


The first part of this project would mostly focus on identifying a suitable model and approach.
Reply
#2
Also, I'm currently working on getting my environment up and running. It's a little complicated since I'm using a non-standard OS(Solus). Honestly, I'm thinking of just giving up and building it in my Ubuntu VM using a shared folder.
Reply
#3
Nice to have you here Smile

Sounds like a very interesting approach.

(2018-03-10, 19:28)hybrideagle Wrote: Also, I'm currently working on getting my environment up and running. It's a little complicated since I'm using a non-standard OS(Solus). Honestly, I'm thinking of just giving up and building it in my Ubuntu VM using a shared folder.
  
Totally valid approach and shouldn't hurt much Smile
Reply

Logout Mark Read Team Forum Stats Members Help
[GSOC-18 Proposal] Automated video segmentation0