Documentaries scraper?
#1
This should be another content type (along with movies, tv shows and music videos). There is a very nice compilation in wiki form at http://docuwiki.net/index.php?title=Special:Random (it is a spin off mvgroup, where there is a very rich source of information, although being in forum format is much more difficult to scrape, I think: http://forums.mvgroup.org/index.php?showtopic=10334
Reply
#2
show me the scraper and i'll add the content type after the feature freeze
Reply
#3
@pko66, checkout the XBMC Online Manual on how to write a scraper (the IMDb scraper is also a good example):
http://wiki.xbmc.org/?title=Category:Scraper
...and feel free to add anything to the manual that you think is missing (it is a wiki after all)

Also see: http://forum.xbmc.org/showthread.php?tid=33710
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#4
Here goes another of my way-too-long messages, sorry, please be patient...

I am learning scraper creation; as practice, I am modifying the "culturalia.es" scraper to enhance it a little and I'm planning to incorporate thumbnail grabbing (and maybe other info) from IMDB; the functionality will be constrained by some limitations of the current implementation of scrapers: if I'm not mistaken, you cannot have the user to select a movie more than once, when you select a movie in culturalia, you use the translated to spanish title as usually you do not know the original title (sometimes they are VERY different, like "Sleepless in Seattle" that was called "Algo para recordar" here). So, the user selects the movie from culturalia and, among other data, you know now the original title which can in turn be used to search the IMDB, but in that second search you cannot have the user to choose the right movie since he already did with the culturalia search... as a workaround, I plan to simply select the first one with the same tittle and year, but it can be wrong one, of course. Is there a better way to do it?

As I learn, I'm writing a "manual", some kind of "scraper creation for dummies", right now is in spanish but I plan to translate it when it is finished. It could be a good addition to the wiki. If someone that knows spanish want to help me "foolproof" it, please tell me so

I suppose the search in IMDB and selection of the link can be implemented using "custom function" but there is very few documentation, and studying the imdb.xml scraper, there is some calls to a $INFO function that I do not know how they work (they seem related to the settings specific to the imdb scraper, that I suppose are stored in videodb)... where can I find information specific to that?

BTW, I think that the imdb scraper is NOT a good place to start learning scraper creation, filmaffinity.xml is way simpler and so much more appropriate to beginers like me...
Reply
#5
correct on the limitation and how to best handle it currently (unless culturalia gives the imdbid).

$INFO[foo] is the value of the setting foo. you can use this to insert string values.
you can execute expressions conditionally using conditional="bar" where bar is a bool setting.
Reply
#6
Greetings.

I also have many documentaries and was interetested in having a scraper for them. Regexp is somewhat familiar to me so wacking together the main parser script wasn't too hard. Few issues that i dont understand though

Currently, scrap is reporting that my scraper returns the following for http://docuwiki.net/?title=Battleplan

Code:
<details><title>Battleplan</title>
<year>2006</year>
<plot><p>Battleplan is a military-based television documentary series examing the various military strategies used in modern warfare,
</p><p>since World War I. It is shown on the Military Channel and UKTV History.
</p><p>Each episode looks at a particular military strategy (or "battleplan") used in warfare, through two well-known historical
</p><p>examples and compares them both with the military requirements needed in order to conduct that "Battleplan". All the episodes
</p><p>use examples from modern warfare, dating from the First World War (1914–18) up to the recent Iraq War (2003).
</p></plot>
<actor><role>hosted</role><name>Eric Meyers</name></actor>
<genre>War</genre>
<episodeguide><episode><title>Blitzkrieg</title><season>1</season><epnum>1</epnum><id>1</id><plot><p>Examples used: Nazi Germany Blitzkrieg Campaign, Battle of France, Second World War and 2003 invasion of Iraq,Iraq War.
</p></plot></episode>
<episode><title>Assault From The Air</title><season>1</season><epnum>2</epnum><id>2</id><plot><p>Battle of Crete, Unternehmen Merkur, Second World War and Operation Junction City , Vietnam War.
</p></plot></episode>
<episode><title>Deception</title><season>1</season><epnum>3</epnum><id>3</id><plot><p>Examples used: Battle of Normandy, D-Day, Second World War and First Gulf War and 2003 invasion of Iraq, Iraq War.
</p></plot></episode>
<episode><title>Assault From The Sea</title><season>1</season><epnum>4</epnum><id>4</id><plot><p>Examples used: Battle of Inchon, Korean War and Battle of Iwo Jima, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Counterstrike</title><season>1</season><epnum>5</epnum><id>5</id><plot><p>Examples used: Yom Kippur War and Battle of Moscow, Second World War.
</p></plot></episode>
<episode><title>Blockade</title><season>1</season><epnum>6</epnum><id>6</id><plot><p>Examples used: Second Battle of the Atlantic, Second World War and US Submarine Campaign 1943-45, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Siege</title><season>1</season><epnum>7</epnum><id>7</id><plot><p>Examples used: Siege of Leningrad, Second World War and Battle of Dien Bien Phu, First Indochina War and Battle of Khe Sanh, Vietnam War.
</p></plot></episode>
<episode><title>Battlefleet</title><season>1</season><epnum>8</epnum><id>8</id><plot><p>Examples used: Battle of Midway, Pacific War, Second World War and Battle of Leyte Gulf, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Pre-Emptive Strike</title><season>1</season><epnum>9</epnum><id>9</id><plot><p>Examples used: Six Day War and attack on Pearl Harbor, Pacific War, Second World War.
</p></plot></episode>
<episode><title>Control of The Air</title><season>1</season><epnum>10</epnum><id>10</id><plot><p>Examples used: Battle of Britain, Second World War and First Gulf War.
</p></plot></episode>
<episode><title>Defensive Battle</title><season>1</season><epnum>11</epnum><id>11</id><plot><p>Examples used: Hindenburg Line, Western Front, First World War and Battle of Kursk, Second World War.
</p></plot></episode>
<episode><title>Guerilla Warfare</title><season>1</season><epnum>12</epnum><id>12</id><plot><p>Examples used: Mujahideen,Soviet war in Afghanistan and National Front for the Liberation of South Vietnam, a.k.a. Vietcong, Vietnam War.
</p></plot></episode>
<episode><title>Urban Warfare</title><season>1</season><epnum>13</epnum><id>13</id><plot><p>Examples used: Tet Offensive, Vietnam War and Battle of Stalingrad.
</p></plot></episode>
<episode><title>Breaking a Fortified Line</title><season>1</season><epnum>14</epnum><id>14</id><plot><p>Examples used: Hindenburg Line,Western Front, First World War and Second Battle of El Alamein, Second World War.
</p></plot></episode>
<episode><title>Raiding Operations</title><season>1</season><epnum>15</epnum><id>15</id><plot><p>Examples used: Unternehmen Eiche, recapture of Mussolini by Otto Skorzeny, Second World War and Operation Ivory Coast, Son Tay, Vietnam War.
</p></plot></episode>
<episode><title>Strategic Bombing</title><season>1</season><epnum>16</epnum><id>16</id><plot><p>Examples used: the RAF/USAAF campaign against Nazi Germany from 1941-45, bombing of Dresden and the USAAF assault on Japan in 1944-45, Bombing of Tokyo in World War II.
</p></plot></episode>
<episode><title>Flank Attack</title><season>1</season><epnum>17</epnum><id>17</id><plot><p>Examples used: Battle of Normandy, D-Day, Second World War and First Gulf War.
</p></plot></episode>
<episode><title>Special Operations</title><season>1</season><epnum>18</epnum><id>18</id><plot><p>Examples used: French Resistance, in the Second World War and 2003 invasion of Iraq, Iraq War.
</p></plot></episode>
</episodeguide>
</details>

obviously the problem is, i have p tags all over the plot's
I've been able to strip the p tags out of the main plot with the following.
Code:
<RegExp input="$$4" output="\1" dest="6">
  <expression noclean="1">((&lt;p&gt;[^&lt;]+&lt;/p&gt;)+)</expression>
</RegExp>
<RegExp input="$$6" output="&lt;plot&gt;\1&lt;/plot&gt;\n" dest="5+">
  <expression ></expression>
</RegExp>

But i cant figure out how to strip the p tags out of the episode plots, the episode section is generated with the following

Code:
<RegExp input="$$3" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;\n" dest="5+">
  <RegExp input="$$4" output="&lt;episode&gt;&lt;title&gt;\2&lt;/title&gt;&lt;season&gt;1&lt;/season&gt;&lt;epnum&gt;\1&lt;/epnum&gt;&lt;id&gt;\1&lt;/id&gt;&lt;plot&gt;\3&lt;/plot&gt;&lt;/episode&gt;\n" dest="3">
    <expression repeat="yes" noclean="1">&lt;span class=&quot;mw-headline&quot;&gt; ([0-9]+)\. ([-a-zA-Z0-9 ]+) &lt;/span&gt;&lt;/h3&gt;\n((&lt;p&gt;[^&lt;]+&lt;/p&gt;)+)</expression>
  </RegExp>
  <expression noclean="1" />
</RegExp>

$$4 contains all the html between the "Information" and "Screenshot" labels from the page. Anyone more experienced with xbmc scrapers have an idea how to strip those p tags from the episode plots?


The other problem with documentaries, currently i have it as a tvshow type, the problem is documentaries are usually either Name.quality-ripinfo.avi or Name.1of10.EpisodeName.quality-ripinfo.avi. How can i updated xbmc to detect 1of10 as being season 1 episode 1? 4of10 as season 1 episode 4, etc. And for single part documentaries, do i still need an episodeguide section, with a single episode?

journey
Reply

Logout Mark Read Team Forum Stats Members Help
Documentaries scraper?0