GSoC/GCI Archive
Google Summer of Code 2015

CCExtractor development

License: GNU General Public License version 2.0 (GPLv2)

Web Page: http://www.ccextractor.org/gsoc2015.html

Mailing List: https://lists.sourceforge.net/lists/listinfo/ccextractor-users

CCExtractor is the de-facto standard tool for closed captioning and subtitles in the open source community.

CCExtractor takes almost any video file that contains subtitles and generates a clean transcript with in-sync timing.

While it's obviously a niche application, it's used in a number of scenarios:

- Universities such as UCLA use it to organize and index their enormous news archive.

- The largest media companies use it to convert the closed captioning of their TV shows to internet formats, since it is now mandatory by the FCC that TV shows delivered via streaming must be closed captioned too.

- Fan subtitles sites use it to feed their archive. If you've ever downloaded a subtitle file for a TV show most likely it came from CCExtractor.

- News clipping companies use it extensively for TV trends analysis.

Our organization if focused on improving CCExtractor and keeping it open source. While our tool is conceptually small, it's getting bigger and bigger because it's meant to support all kind of subtitles from every corner of the world. Standards are different in Europe, South America, North America, Asia... and in fact with the migration from analog to digital each zone has at least two. Then you can have them in MP4, MPEG2, etc.

Because of the nature of our tool, many other higher level programs use it (such as HandBrake). On the other hand, we took code from MythTV and others. We also integrate (optionally) with ffmpeg. Because of this, we expect lots of cross-project ideas.

We'd expect students to have an interest in one or more of these topics: Subtitles, video, compression, networking, multithreading...  and having an interest is really all it takes. It doesn't matter if you don't have the experience.

Projects

  • Adding translations to CCExtractor and Website for comparing statistics Adding more features for the users and making it more accessible. Automatic Translation of the present output. This will help users to get their favorite shows with subtitles in another language. Get a website to find out the dependence of stock prices of certain companies on social outreach(Twitter and TV mentions).
  • CCExtractor GSOC 2015 proposal For this proposal I picked up five ideas from the ideas list (http://ccextractor.sourceforge.net/gsoc2015.html): 1. Real time translation with Google Translate API 2. Support of multiprogram streams 3. Finalizing CEA-708 support 4. Linux GUI Interface 5. Mac OS X GUI Interface
  • Complete support for EIA-708 and Implement Multi-Program Feature Write Test case to verify all options are working. Complete EIA-708 support for CCextractor Implement Multi Program feature of CCextractor
  • Constraint Based Speaker Diarization Module for Heterogeneous News data The framework proposed will implement speaker diarization module for a massive heterogeneous television news corpus and pre-processing steps which will allow a user to use it’s own segmentation information or can use auto-segmentation. Another contribution is to make a speaker identification module within a network using technique called “Constrained Global Clustering” across video files. This framework will be then compiled with open-source toolKit called “voice-id” to use Multi-Modal approach.
  • Improving the development lifecycle An improvement of the development lifecycle for CCExtractor will help raise the quality of the code and the implementation of several standards.
  • Multi-language Forced Alignment in a Heterogenous Corpus The current transcripts corresponding to the videos are both imperfect (OOV and lag). This project seeks to correct the transcripts by developing techniques to first detect errors in alignment and then produce correction algorithms to reduce the frequency of these errors. By combining different techniques, an accurate forced alignment package will be generated, which will be able to operate in adverse conditions found in both the transcript and audio.
  • Proposal Networking etc