TubeKit: A Youtube Crawling Toolkit

Video Tutorial

Step-by-Step Guide

Czech translation (translated by Alex Huston from autip.com)

Getting Started

Download TubeKit.
Make sure you have other tools downloaded and configured as listed on the download page. TubeKit has been tested on Linux and Mac and it should work fine with any other UNIX-based system.
Move the dowloaded zip file to a place where your web-server can see it. Unzip the file. This should create a directory called 'TubeKit'.
Create a subdirectory in 'TubeKit' directory with a name of your project. For instance, if you want to create a crawler for travel related videos, you might want to create a subdirectory called 'Travel'. Make sure your web-server has write permissions to the project directory.
Point your browser to 'TubeKit' directory through the web-server. If you have a standard configuration, this could be http://127.0.0.1/TubeKit. Ask your system administrator if you are not clear about using the web-server.
Click on 'Setup' and fill in the details. Description of these fields is given below.

Basic Configuration

Specify a name of your project. Make sure this is same as the name of the directory that you created for your project. In the example given above, this will be 'Travel' (watch for the case).
Give a prefix. This will be used to name the files generated. For travel project, this could be 'tr'.
Give the path to store the crawler. For our example above, this could be /home/public_html/TubeKit/Travel. DO NOT include '/' at the end. TubeKit will create two subdirectories in there: one for storing flash videos, and the other one for MPEG videos.

Database Setup

Provide the host name. If the database is on your local machine, this would be 'localhost'.
Give a name for your database. For instance, 'travel'.
Provide username for your MySQL account.
Provide password for your MySQL account.

YouTube Setup

Select from 17 different attributes to crawl for each video. You can select an attribute to be collected only once (the first time that video is crawled), every time, or never.

Crawling Setup

Specify when each of the components should be executed on a regular basis. If something is not relevant, put '*'. Brief description of these components is given below.

Execute queries (component-1): run the seed queries on YouTube and collect the top 100 results. Each time there is a new video, extract the attributes that are marked to be collected only once.
Crawl (component-2): look into the list of videos resulted by executing the queries, go back to YouTube and extract the 'Every time' marked attributes for those videos.
Download videos in Flash format (component-3): YouTube has videos in flash format. We use a third party tool called 'youtube-dl' to download flash videos.
Convert videos to MPEG (component-4): it may be helpful to have videos in more standardized format such as mpeg. This component does exactly that using 'ffmpeg'

Note: typically, you should put component-1 run first. It should take a few minutes based on the number of queries. Running component-2 after component-1 is finished is a good practice, so that you have the information about all the new results for the queries that you have. Components 3 and 4 can be run any time. Scheduling is optional; if you want, you can run all the components manually.

What happens next?

Once you click 'Run Setup', the following things happen.

Configuration paramaters are stored in project_directory/config.php.
Other scripts are created and stored in the project directory.
Database is created.
Tables for queries, crawl-once, and crawl-everytime are created.
A file is created that lists your cron jobs (scheduled events on a UNIX system). Use this file to schedule your cron jobs ('crontab cron_file_name' command) or append to your existing cron jobs (using 'crontab -e' command).

You should now be able to browse your crawler by pointing your browser to that subdirectory. In our example case, it could be http://127.0.0.1/TubeKit/Travel. If you want to edit cron jobs, use 'crontab -e' command.

Click on 'Queries' link and add seed queries in the interface that comes up. These queries will now continue to be monitored as per the configuration. If you want to edit the queries, you should directly login to your MySQL database and access the table 'prefix_queries'.