2018-03-10, 17:36
Goals: Given a video, break it down into a series of component parts. Mark these parts using the chapters/bookmarks system.
Parts could include things like:
Stretch goal: Build a platform making it easier to perform this type of analysis. This could be useful if someone wants to build a system to perform genre detection, for example.
Benefits: This could help anyone with video files that aren't properly segmented, making it easier to navigate around the video. A few use cases, for example:
What does it touch in Kodi:
Requirements: Machine learning knowledge, Python and C++, any computer with a decent Nvidia GPU(or a cloud equivalent).
Possible mentors: Razze
Methods: This project would be implemented through a deep learning model, specifically some combination of convolutional networks and recurrent networks. While the specific libraries and architectures need (much) more thought, here are a few initial suggestions:
Libraries(order of preference):
Here's one basic end-to-end ML approach:
We model the task as a classification task, with the goal of classifying each moment as one of a set of classes.
The model is a basic recurrent(LSTM) model, that, for each time step, inputs a frame of video and a sample of audio, and produces a softmax probability distribution across a set of classes, with one class for each type of part.
The output is then smoothed (possibly with a HMM) to get the final breakdown.
Here's a slightly less ML-heavy approach:
We model the task as an edge detection task, with the goal of identifying boundaries between different segments.
The model is again a recurrent network, but this time has only one output, producing the likelihood of the current moment being an edge. The network is run across the video, producing a time-series of probabilities. This series is smoothed, segment boundaries are extracted, and the segments are then run through a classifier(convolutional) to figure out what type of segment it is.
The first part of this project would mostly focus on identifying a suitable model and approach.
Parts could include things like:
- Prologue
- Intro
- part 1
- Titlecard
- part 2
- Outro
- Credits
Stretch goal: Build a platform making it easier to perform this type of analysis. This could be useful if someone wants to build a system to perform genre detection, for example.
Benefits: This could help anyone with video files that aren't properly segmented, making it easier to navigate around the video. A few use cases, for example:
- Skipping intros and outros
- Jumping to a song in a musical
What does it touch in Kodi:
- Reading the audio and video of the file
- Bookmarks and chapters
Requirements: Machine learning knowledge, Python and C++, any computer with a decent Nvidia GPU(or a cloud equivalent).
Possible mentors: Razze
Methods: This project would be implemented through a deep learning model, specifically some combination of convolutional networks and recurrent networks. While the specific libraries and architectures need (much) more thought, here are a few initial suggestions:
Libraries(order of preference):
- Tensorflow/Keras( (+)standard and easy to use (-) slow, requires an extra dependency)
- PyTorch
- CNTK
- The ratio of decisions made by the model and by heuristics(our code).
- The required granularity and accuracy.
- The available data(copyright issues and all that).
- The number of classes.
Here's one basic end-to-end ML approach:
We model the task as a classification task, with the goal of classifying each moment as one of a set of classes.
The model is a basic recurrent(LSTM) model, that, for each time step, inputs a frame of video and a sample of audio, and produces a softmax probability distribution across a set of classes, with one class for each type of part.
The output is then smoothed (possibly with a HMM) to get the final breakdown.
Here's a slightly less ML-heavy approach:
We model the task as an edge detection task, with the goal of identifying boundaries between different segments.
The model is again a recurrent network, but this time has only one output, producing the likelihood of the current moment being an edge. The network is run across the video, producing a time-series of probabilities. This series is smoothed, segment boundaries are extracted, and the segments are then run through a classifier(convolutional) to figure out what type of segment it is.
The first part of this project would mostly focus on identifying a suitable model and approach.