docuwiki.com scraper in development

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
journey4712 Offline
Junior Member
Posts: 6
Joined: Oct 2008
Reputation: 0
Post: #1
Greetings.

I've started working on a docuwiki.com scraper. Even though its a wiki its got a fairly static structure that can be scraped most of the time. I've got the basics down(extracting the title/year/narrator and the episode titles/plots(if they exists). I'm running into an issue though of being unsure how the documentaries should be organized, since they are somewhere between tvshows and movies.

They are like tvshows in the *some* of them come in multi-part series. They are like movies in that a good number are only single part though. What i'm not sure of is how they need to be organized to be scanned into xbmc with the single parters being recognized as single part documentaries(movies esentially), and the multi-parts being recognized as a show with episodes. Do i need to seperate them into different source directories with different scrapers? a "movie" scraper for the single part documentaries, and a "tvshow" scraper for the multi part? would be essentially the same scraper so seems a bad hack to do it that way.

The next problem relates to advancedsettings.xml. When creating the regexp for <tvshowmatching> it seems it needs to detect the season and the episode from each name. The main problem though is that documentaries dont have a season, for simplicity sake my scraper currently outputs all episodes as part of season 1. Documentaries are usually Name.XXofYY.EpisodeTitle.quality.ripgroup.avi. How can i recognize 2of3 or 5of9 as being Season 1 episode 2, or season 1 episode 5 based off those file names? Its escaping me beacause i need to capture a 1 for the season but there is not reliably a "1" present in the names.

Basically, it seems documentaries dont fit in very well as movies or as tv shows, any pointers to getting this done would be well apreciated. Or should i be submitting a trac ticket for a third type of video file, movies, tvshows, and the new documentaries? I'm not particularly excited to submit a trac ticket for this though because i imagine it could be a few months before anything solid happens(if ever) in respect to a new type of video file.

journey4712
find quote
journey4712 Offline
Junior Member
Posts: 6
Joined: Oct 2008
Reputation: 0
Post: #2
Another thing i forgot to mention regarding filenames. Documentaries cataloged at docuwiki use the following naming convention for extras:

Name.1of3.title.avi
Name.2of3.title.avi
Name.3of3.title.avi
Name.Extras.1of2.title.avi (or sometimes Name.Extras.1.title.avi)
Name.Extras.2of2.title.avi (or sometimes Name.Extras.2.title.avi)

This also creates complications in matching if trying to use 1of3 as season 1 ep 1, because the extras get seen as being part of the season 1 as well. I'm starting to wonder, is there anyway i can reference episodes in the scraper by title instead of by episode? Every documentary i've looked at on docuwiki follows that same naming format(filename is available on docuwiki), always having the title of the episode. In most cases docuwiki doesn't even show episode numbers with the episode descriptions, just the title of the episode and then the plot, followed by the next episode.

For an example page, look at A History of Britain or Battleplan. Unfortunatly they arn't exactly the same as battleplan uses episode numbers in the episode names, and a history of britain(along with most of the site) does not.

journey4712
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #3
hi,

sorry for the late reply this somehow went past my attention.

to support this properly we need to define some semantics and a new content type for documentaries. from what i can see we need added logic for
1) support 'movie' and 'tvshow' documentaries. this should be quite easy to add, it would just be some flags in the returned xml from GetSearchResults.
2) some additional rules for extras-naming. i guess this should be done in a general way to support movies and tvshows as well.
3) possibly support for scraping episodes by filenames - this one we want to avoid if possible (but certainly doable if we deem it necessary).

but in any case, we will have to return to this after atlantis. nag me then Smile
find quote
voiddreamer Offline
Junior Member
Posts: 22
Joined: Nov 2008
Reputation: 1
Post: #4
I'm very interested in this and had thought of making my own. Any update on this?
find quote
Jyujinkai Offline
Senior Member
Posts: 191
Joined: Jul 2008
Reputation: 0
Post: #5
2008... guess this never happened?
find quote