MKV tag scraper

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
gregorv Offline
Junior Member
Posts: 6
Joined: Jun 2014
Reputation: 0
Star  MKV tag scraper
Post: #1
I have tagged all my movies and actually there is no need to get any movie details online. As there is (apparently) no easy way to read such tag directly; I created a php script that creates the required results and details XML sections that can be read by a simple scraper.
The php script reads all tags from a matroska container and also looks for the tags under different 'TargetTypeValue' sections. It build arrays from the 'TargetTypeValue' 70 (Series), 60 (Seasons), 50 (Episodes) and 40 (movie parts). So it does build different collections for series, collections and multipart movies. Currently series with seasons and episodes are just collected under the collection name "Series: name of the series", so there is still a ToDo for the real TVShow stuff. Also the posters are read from the matroska container and added to the library. The scraper works completely independent from your movie file names and also does not reqire any additoonal configuration in advancedstettings.xml it retrieves all infos about collections from the tags.

Requirements:
A WEB server with PHP on a Linux machine (is included standard for linux) (must not be the machine where XBMC is running) , mkvextract from mkv-tools.
create a folder in /var/www/html - e.g. 'movies' and a sub folder 'thumbs' and copy the following script into the folder 'movies' as read_mkv_details.php
PHP Code:
<head>
<
meta http-equiv="content-type" content="text/html; charset=UTF-8"
</
head>
<?
php
// Scraper for tagged MKV files, Gregor Verweyen.
// this script extracts MKV tags and the a cover from mkv files and creates the XML output for the scraper
// it also creates collections, aggregates multipart movies... for multiple tags of the same type like actors the mkv tag can either have mutiple tags or a
// comma separated list in one tag field.
// this scraper does not rely on any file naming conventions or any settings in advancedsettings.xml e.g. for collections, such infos are retrieved from the tags only
// ToDo: currently it just creates collections from seasons with episodes - need to dig out, how this tvshow mechanism works - any help appreciated, thanks

// some variables that you can change for your needs
$spath="/movie";                                           // path where to search movie files
$server_url="http://vnet-centos.fritz.box/movies/";        // change this to your server and folder
$SeriesCollection=" - Serie";                              // some German collection names, can be translated, here for seasons with episodes
$MultipartCollection=" - Mehrteiler";                      // here for movies with multiple parts movie part 1 and movie part 2
$SeriesNoOrder=" - Reihe";                                 // here for series with now specific order, e.g. Tatort - Reihe
$ext=".mkv";                                               // extension of movie files (needs to be matroska container)
$DebugFlag=false;                                          // set true to get debug output on web page


putenv("LANG=de_DE");                                      // required for language specific chars (ä,ö.ü...)
$title=$_GET["title"];                                     // get movie name from hmml parameter 'title'
$title=rawurldecode($title);                               // replace html codes (e.g. %20 for space)
$res_title=$title;                                         // save title, we need it later without quotes
$title "\"$title\"";                                     // put name into quotes
if($DebugFlag) print "<pre>search title: $title</pre>\n";  // check result if debug flag set to true
$file=explode("\n", `find $spath -iname $title$ext`);      // find movie file in folder & write each result into array
if($DebugFlag) print "<pre>file1: $file[0]</pre>\n";       // check result if debug flag set to true
$file[0] = "\"$file[0]\"";                                 // put name into quotes - we only use the first found file

$tmp = `env`;                                              // required for language specific chars (ä,ö.ü...)
                                                           // read all XML tags from movie file
$xmltags = `mkvextract --command-line-charset ISO-8859-1 tags $file[0] --output-charset UTF-8 --ui-language de_DE`;
$coverfile="thumbs/$title.jpg";                                                           // read cover and save it as cover.jpg
$jpgresult = `mkvextract --command-line-charset ISO-8859-1 attachments $file[0] 1:$coverfile`;
if(
$DebugFlag) print "<pre>all tags: $xmltags</pre>\n";    // check result if debug flag set to true

// we are now going to build arrays for the Tag sections for series, season, episode and part tags. The array elements contain the
// tag names as key and the tag values as value.
// for multiple tag names (ACTORS,GENRES) we create keys with indexed names like ACTOR1, ACTOR2... 
// There can be multiple Tag sections for TargetValue 50, we add them and get as result the user defined Tags and the
// Tags which contain the details for the first track (usually video) at least if the file was multiplexed with mkvmerge

$xml simplexml_load_string($xmltags);                    // prepare XML string for reading items
for ($i=$i 10$i++) {                               // loop through XML items and find values for TargetTypeValue
  
switch("{$xml->Tag[$i]->Targets->TargetTypeValue}") {
    case 
"70":                                             // the file has series tags (for seasons (60) and episodes(50)
      
$Series=ReadTags($xml$i);                          // read series Tags
      
$ix++;                                               // increment Tag index to pint to next Tag section
      
$Flag70=true;
      break;
    case 
"60":                                             // the file has season tags 
      
$Season=ReadTags($xml$i);                          // read Season Tags
      
$ix++;                                               // increment Tag index to pint to next Tag section
      
$Flag60=true;
      break;
    case 
"50":                                             // the file has episode tags (main movie tags)
      
if ($Flag50) break;                                  // if we have our tags already then skip (multiple 50 sections)
      
$Episode=ReadTags($xml$i);                         // read movie/episode tags
      
if (strlen($Episode['TITLE']) > 0$Flag50=true;     // we found the correct 50 section
      
$ix++;                                               // increment Tag index to pint to next Tag section
      
break;
    case 
"40":                                             // the file has movie part tags (main movie has multiple parts)
      
$Part=ReadTags($xml$i);                            // read movie part Tags
      
$ix++;                                               // increment Tag index to pint to next Tag section
      
$Flag40=true;
      break;
    default: 
      break;
  }
}
function 
ReadTags($xml$TIx) {                            // function for reading Tags expects Tag index, returns Tag Array
  
$ac=1$gc=1;  $dc=1;                                    // name appendix for multi tags (actors and genres)
  
for ($n=$n 50$n++) {                             // loop through all Simple tags inside of a Tag section 
    
$tmp="{$xml->Tag[$TIx]->Simple[$n]->Name}";            // get name of the tag
    
if ($tmp == "")  break;                                // if name is empty, leave loop
    
if ($tmp == "ACTOR") {$tmp.=$ac$ac++;};              // if name is ACTOR add appendix (ACTOR1, ACTOR2...)
    
if ($tmp == "GENRE") {$tmp.=$gc$gc++;};              // if name is GENRE add appendix (GENRE1, GENRE2...)
    
if ($tmp == "DIRECTOR") {$tmp.=$dc$dc++;};           // if name is GENRE add appendix (DIRECTOR1, DIRECTOR2...)
    
$TagArr[$tmp]="{$xml->Tag[$TIx]->Simple[$n]->String}"// add key (Name) and value (String) to array
  
}
  return 
$TagArr;                                          // return the array
}

if (
$DebugFlag) {                                          // check result if debug flag set to true
  
print "<pre>Series Tags:</pre>\n";
  
print_r($Series);
  print 
"<pre>Season Tags:</pre>\n";
  
print_r($Season);
  print 
"<pre>Movie Tags:</pre>\n";
  
print_r($Episode);
  print 
"<pre>Movie Part Tags:</pre>\n";
  
print_r($Part);
  print 
"<pre> </pre>\n";
}
function 
GetMultiTags($TagName$TagArray) {               // function to concatenate multiple Tags (like for Actors)
  
for ($i=1$i<50$i++) {                                // loop tag names 1-n 
    
$tmp=$TagArray[$TagName $i];                         // create name of XML-element (name+number)
    
if ($tmp=="") break; elseif ($Result!=""$tmp ", " $tmp;  // end loop if no more tags otherwise add ', ' separator if it is not the first 
    
$Result .= $tmp;                                       // create comma separated list
  
}
  return 
$Result;
}

// here we start to assemble the details for the scraper

                                                           // her we build the result XML format with one entity for GetSearchResults
$the_url=$server_url "read_mkv_details.php?title=" rawurlencode($res_title);
  
$result="
<results>
  <entity>
    <title>
$res_title</title>
    <language>de</language>
    <id>
{$Series['IMDB']}</id> 
    <url>
$the_url</url>
  </entity>
</results>\n"
;
print 
$result;                                             // this is read from XBMC in scraper function GetSearchResults

// now we put the details together. Series with seasons and episodes are (currently) created as a collection, collections are created and a
// film series (without specific order) or if it is a movie with multiple parts have different collection names.
// there are some function to build xml parts for multiple fields like actors, directors and genres
$MovieName=$Episode['TITLE'];                            // may be changed in the following function
$set=CheckStructure();                                   // check collection series multi part...
if($DebugFlag) print "<pre>set value: $set</pre>\n";     // check result if debug flag set to true
                                                         // now the movie details
$details="
<details>                      
  <title>
$MovieName</title>
  <originaltitle>
{$Episode["ORIGINAL_TITLE"]}</originaltitle>
  <year>
{$Episode["DATE_RELEASED"]}</year>
  <country>
{$Episode["COUNTRY"]}</country>
  <id>
{$Episode["IMDB"]}</id>
  <outline></outline>
  <top250>0</top250>
  <mpaa>
{$Episode["LAW_RATING"]}</mpaa>\n  " .
  
FormatMultiTags('DIRECTOR'$Episode) . "
  <tagline></tagline>
  <runtime>
{$Episode["LENGTH"]}</runtime>
  <studio></studio>
  <trailer></trailer>
  <thumb>" 
$server_url "thumbs/" rawurlencode($res_title) . ".jpg</thumb>
  <credits></credits>
  <rating>
{$Episode["RATING"]}</rating>
  <votes></votes>\n  " 
.
  
FormatMultiTags('GENRE'$Episode) . "\n  " 
  
FormatMultiTags('ACTOR'$Episode) . "
  <outline></outline>
  <plot>
{$Episode["SUMMARY"]}</plot>
  <status></status>
  <code></code>
  <set>
$set</set>
  <file></file>
</details>\n"
;

print 
$details;                                            // this is read from XBMC in scraper function GetDetails

// for this function we assume that 1.seasons and episodes have Type 70,60 and 50 sections with series titles, episode titles and 
// part numbers in the 60 (season) and 50 (episode) section.
// for collections we assume that we have type 70 and 50 sections with a part number (in 50 section)
// for film series (without specific order) we assume that we have type 70 and 50 sections without a part number
// for movie parts  we assume that we have type 50 and 40 sections with a part number (in 40 section)
function CheckStructure() {
  
$set="";
  global 
$Flag70$Flag60$Flag50$Flag40$Series$Season$Episode$MovieName$Part$SeriesCollection$MultipartCollection$SeriesNoOrder;
  if (
$Flag50 and $Flag40)  {                              // it is a movie with multiple parts (ToDo check > 1 episodes in one file)
    
$set=$Episode['TITLE'] . $MultipartCollection;         // results in e.g. Mehrteiler: Das Bernsteinzimmer Teil 1 , Teil 2
    
$MovieName=$Part['TITLE'];
  }
  if (
$Flag70 and $Flag60 and $Flag50) {                   // it is a series with seasons and episodes
    
$set=$Series['TITLE'] . $SeriesCollection;
    
$MovieName=$Season['PART_NUMBER'] . "-" $Episode['PART_NUMBER'] . ": " $Episode['TITLE'];
  }
  if (
$Flag70 and $Flag50 and !$Flag60 )  {                // film series or collection
    
if (strlen($Episode['PART_NUMBER']) > 0) {             // we have a collection
      
$set=$Series['TITLE'] . " - Collection";             // results in e.g. Alien - Collection
    
}
    else                                                   
// we have a film series
    
{
      
$set=$Series['TITLE'] . $SeriesNoOrder;              // results in e.g. Tatort - Reihe
    
}
  }
  return 
$set;
}


// this function creates XML for the multi tag fields director, actor, genre. The Tags in the MKV file can either be a comma separated list
// or multiple tags. The first will be in ARRAY element ACTOR1, multiples are stored in the ARRAY as e.g. ACTOR1,ACTOR2...
function FormatMultiTags($TagName$TagArray) {            // function to build XML multiple Tags (like for Actors)
  
$Result="";                                              // initialize Return value
  
$pos strpos($TagArray[$TagName '1'], ',');           // check if there is a comma separated list
  
if ($pos === false) {                                    // if not, loop through all array elements that start with 'TagName'
    
for ($i=1$i<50$i++) {                              // max. 50 loops 
      
$tmp=$TagArray[$TagName $i];                       // loop tag names 1-n e.g. ACTOR1,ACTOR2...
      
if ($tmp=="") break;                                 // end loop if no more values
      
$Result.=AddTag($TagName,$tmp);                      // add the formatted tags (for each element) into result
    
}
  }
  else {                                                   
// make array from comma separated list
    
$tmparr=explode(","trim(str_replace(", "",""{$TagArray[$TagName '1']}"))); 
    foreach (
$tmparr as &$tmp) {                           // for each element in the array list
      
$result.=AddTag($TagName,$tmp);                      // add the formatted tags (for each element) into result
    
}
  }
  
$Result=substr($Resultstrlen($Result) -3);         // remove the first 2 spaces and the last new line char 
  
return $Result;
}
// helper function called from FormatMultiTags
function AddTag($TagName,$ItemName) {                      // returns formatted xml entries for the item given by TagName
  
switch($TagName) {
    case 
"ACTOR":                                          // build the xml entry for an actor
      
$Result="  <actor>\n    <name>$ItemName</name>\n  </actor>\n";
      break;
    case 
"DIRECTOR":                                       // build the xml entry for a director
      
$Result="  <director>$ItemName</director>\n";
      break;
    case 
"GENRE":                                          // build the xml entry for an actor
      
$Result="  <genre>$ItemName</genre>\n";
      break;
  }
  return 
$Result;
}

?>

The script needs read access to the hd(s) with the movie files and write access to the 'thumbs' subfolder.
Before you start the variable $spath= should point to the folder, where your movies are and $server= should contain the address of your web server (including the 'movies' folder).
It can be tested with 'http://name_of_your_server/movies/read_mkv_details.php?title=any_movie_file_name" and it shoud show some details in the browser.
If you set the DebugFlag to true it gives some more details - anyhow if you look into the html source you should see a results and a details section (XML format).

the scraper itself is quite simple: I created a folder 'metadata.mkv.tags' in addons (user area) and put the scraper files into it.
addon.xml:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<addon id="metadata.mkv.tags"
       name="MKV-TAGs"
       version="1.0.0"
       provider-name="Team XBMC + Gregor Verweyen">
  <requires>
    <import addon="xbmc.metadata" version="2.1.0"/>
  </requires>
  <extension point="xbmc.metadata.scraper.movies"
             language="de"
             library="mkvtags.xml"/>
  <extension point="xbmc.addon.metadata">
    <summary lang="de">MKV-TAG Film Scraper</summary>
    <summary lang="en">MKV-TAG Movie Scraper</summary>
    <summary lang="hu">MKV-TAG filmadat leolvasó</summary>
    <summary lang="kr">MKV-TAG 영화 스크래퍼</summary>
    <summary lang="se">MKV-TAG Filmskrapa</summary>
    <summary lang="nl">MKV-TAG Film Scraper</summary>
    <summary lang="pl">Scraper filmów MKV-TAG</summary>
    <summary lang="pt">Scraper de filmes MKV-TAG</summary>
    <description lang="de">Lade Filminformation von MKV-TAGs</description>
    <description lang="en">Download Movie information from MKV-TAGs</description>
    <description lang="hu">Film információk letöltése a MKV-TAGs webhelyről</description>
    <description lang="kr">MKV-TAGs 에서 영화 정보 다운로드</description>
    <description lang="se">Ladda ner filminformation från MKV-TAGs</description>
    <description lang="nl">Download film informatie van MKV-TAGs</description>
    <description lang="pl">Pobieraj informacje o filmach z MKV-TAGs</description>
    <description lang="pt">Descarregar informação de filmes de MKV-TAGs</description>
  </extension>
</addon>
and the file mkvtags.xml
Code:
<?xml version="1.0" encoding="UTF-8"?>
<scraper framework="1.1" date="2012-01-16">

    <NfoUrl dest="3">
        <RegExp input="$$1" output="\1" dest="3">
            <expression>(.*)</expression>
        </RegExp>
    </NfoUrl>
<!-- change the url, that it finds the php script on your server -->
    <CreateSearchUrl  dest="3">
        <RegExp input="$$1" output="&lt;url&gt;http://vnet-centos.fritz.box/movies/read_mkv_details.php?title=\1&lt;url&gt;" dest="3">
            <expression noclean="1">(.*)</expression>
        </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="4">
        <RegExp input="$$1" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; standalone=&quot;yes&quot;?&gt;\1" dest="4">
            <expression noclean="1">(&lt;results&gt;.*&lt;/results&gt;)</expression>
        </RegExp>
    </GetSearchResults>

    <GetDetails dest="5">
        <RegExp input="$$1" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; standalone=&quot;yes&quot;?&gt;\1" dest="5">
                <expression noclean="1">(&lt;details&gt;.*&lt;/details&gt;)</expression>
        </RegExp>
    </GetDetails>
</scraper>
Also here change the URL in the CreateSearchUrl function to point to your server.
To make it nicer, I added a little icon and of coure a changelog.txt

May be someone can tell me how to attach files here, then it is easier for you Smile.

Last but not least: I want to extend the stuff to get more infos for seasons and episodes but until now I could not find out which XML sections are needed by a TVShow scraper. There is a mystery episodeguide section pointing to a ZIP file and there I am stuck because I don't want to create that ZIP file.
Anyhow in this stage the scraper works already in a way that is quite usable.

Have fun and if you can support me with the TVShow part - thanks.
By the way in the logs I have seen that XBMC also reads tags with ffmpeg but apparently it is just collecting the stream details but no film infos.
If you want to tag mkv files - this is easy and fast because it does NOT require new multiplexing here the command:
Code:
mkvpropedit "movie_file_name" -t global:"xml_file_name"
Quote:Edit: Changed the vars $SeriesCollection and $MultipartCollection - so that the movie or series name is first. Otherwise all series can only be found under 's' like series
Quote:Edit: Changed the code $the_url=$server_url . "read_mkv_details.php?title=" . rawurlencode($res_title);
because special chars like '#' also need to be html-encoded (could not scan "#9 - Nach dem Ende unserer Welt fängt ihre Mission an.mkv"
(This post was last modified: 2014-07-17 08:54 by gregorv.)
find quote
gregorv Offline
Junior Member
Posts: 6
Joined: Jun 2014
Reputation: 0
Post: #2
Perhaps someone knows how to make XMBC NOT remove numbers in movie file names. I cannot scan Alien Nation - Spacecop LA 1991.mkv.
Thanks
Edit:
OK. found the solution myself. Smile
I added the following to the advandedsettings.xml:
Code:
<video>
  <cleanstrings>
   <regexp>(.*)</regexp>
  </cleanstrings>
  <cleandatetime>(.*)</cleandatetime>
</video>
Now dates are included in the filename for the scraper search
With this now also a file name like M.A.S.H.mkv work
(This post was last modified: 2014-07-15 00:13 by gregorv.)
find quote
froidger Offline
Junior Member
Posts: 4
Joined: Sep 2014
Reputation: 0
Post: #3
Thanks for sharing!
find quote