HOW-TO write Media Info Scrapers - Scraper creation for dummies

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #16
well, you can just goggle for the xml-escape chars, it is general xml.
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Information  Changes to thumb handling for scrapes and NFO! ALL SCRAPER DEVELOPERS PLEASE READ!
Post: #17
hi guys,

i'm sorry that i have to do this to all of you, but it was necessary as the current state was just embarrassing Smile

please mind http://trac.xbmc.org/changeset/21882

i opted to not keep the old loading code as this would never die. one clean, painful cut Smile
find quote
UsagiYojimbo Offline
Member
Posts: 85
Joined: Feb 2010
Reputation: 1
Location: Debrecen, Hungary
Star   
Post: #18
Nicezia Wrote:for the record when i use scrap using scrap.exe the url returns 'Blah+blah+blah' when i use xbmc the log reports that its scrapng for 'Blah%20blah%20blah'
Well, both of the URI's mean the same, as both the %20, and the plus sign are evaluated to a space character... Nerd

"Now something totally different..." Big Grin

Is there a chapter 3 planned, about scraping tv-shows (tv series in particular)?
As all documentation deals with scraping movies, but I did not found any tv-show related info. Confused
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #19
UsagiYojimbo Wrote:Well, both of the URI's mean the same, as both the %20, and the plus sign are evaluated to a space character... Nerd?

No some sites don't interpret + and %20 as meaning the same thing....as i found out when dealing with some scrapers i was writing a + will mean to some sites that this exact word MUST exist as written... while %20 (space) allows for fuzzy search

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
find quote
UsagiYojimbo Offline
Member
Posts: 85
Joined: Feb 2010
Reputation: 1
Location: Debrecen, Hungary
Post: #20
Nicezia Wrote:No some sites don't interpret + and %20 as meaning the same thing....as i found out when dealing with some scrapers i was writing a + will mean to some sites that this exact word MUST exist as written... while %20 (space) allows for fuzzy search
Well, see for yourself:HTML URL Encoding @ W3Schools Nerd
On the other hand, some scripts do not handle it properly... However, the URL/URI is correct.

BTW, what you mention, the plus sign haveing extra meaning: it is possible, if that plus sign is encoded as %2B. If that happens, that means that there is some multiple encoding to your URL/URI... Try removing them, until only one remains. (Well in case of a scraper, you could remove the encode attribute.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #21
UsagiYojimbo Wrote:Well, see for yourself:HTML URL Encoding @ W3Schools Nerd
On the other hand, some scripts do not handle it properly... However, the URL/URI is correct.

BTW, what you mention, the plus sign haveing extra meaning: it is possible, if that plus sign is encoded as %2B. If that happens, that means that there is some multiple encoding to your URL/URI... Try removing them, until only one remains. (Well in case of a scraper, you could remove the encode attribute.)

um... sure if it was a simple matter of encoding yes they would mean the same, but what we're dealing with is the site's search routine and not simply the encoding.... apparently the site for which i am working with interprets a + as MUST exist and not as a space... its server side interpretation, not W3 consortium standard encoding that is the matter... I understand what you're saing but the interpretation of symbols is sometimes different in a website's search routine

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
(This post was last modified: 2010-07-09 05:13 by Nicezia.)
find quote
trondmm Offline
Junior Member
Posts: 4
Joined: Mar 2009
Reputation: 0
Post: #22
Nicezia Wrote:No some sites don't interpret + and %20 as meaning the same thing

That's because they don't mean the same thing everywhere.

%20 means space in the HTTP standard, while + means space in the CGI standard. So, as part of a key=value pair, after the ? in the URL, both %20 and + means space. But, before the ? only %20 will mean space.
find quote
UsagiYojimbo Offline
Member
Posts: 85
Joined: Feb 2010
Reputation: 1
Location: Debrecen, Hungary
Post: #23
trondmm Wrote:%20 means space in the HTTP standard, while + means space in the CGI standard. So, as part of a key=value pair, after the ? in the URL, both %20 and + means space. But, before the ? only %20 will mean space.
That is true, but it does not apply here, as the problem occurs in a search query, thus after the ? mark...
find quote
UsagiYojimbo Offline
Member
Posts: 85
Joined: Feb 2010
Reputation: 1
Location: Debrecen, Hungary
Post: #24
Nicezia Wrote:um... sure if it was a simple matter of encoding yes they would mean the same, but what we're dealing with is the site's search routine and not simply the encoding.... apparently the site for which i am working with interprets a + as MUST exist and not as a space... its server side interpretation, not W3 consortium standard encoding that is the matter... I understand what you're saing but the interpretation of symbols is sometimes different in a website's search routine
Well it must be encoding, otherwise the search engine of that site would not come across your + sign, just a space character...
find quote
Yukon150 Offline
Junior Member
Posts: 10
Joined: Nov 2011
Reputation: 0
Post: #25
Not sure if Im in the right forum

I have a website that I can watch live tv (makolive). Anybody know how I can watch them on my ATV2.

Thanks -
find quote
LastCoder Offline
Fan
Posts: 434
Joined: Dec 2011
Reputation: 0
Post: #26
Hi,

did something change with Eden ? I read in different threads something about unresolved dependencies ?!?!

is there a short way, means without having installed whole XBMC, to test scrapers on linux like it's described for windows (using scrap.exe and so) ?

Greetz

LastCoder

Ubuntu 12.04 LTS Server, Xfce, XBMC Gotham, Skin reFred, tvheadend tv backend
ASUS P8H61-M LE/USB3, Celeron G530, Geforce 210, 4 GB DDR3 RAM
16 GB CnMemory 300x CF, 1 TB Samsung 2,5" HDD
iHOS104 BluRay Drive, TT DVBS2-1600
Silverstone GD05B Case, Sony PS3 BD Remote control, Logitech Cordless Mediaboard Pro for PS3
find quote
freak1 Offline
Senior Member
Posts: 122
Joined: Jan 2012
Reputation: 1
Post: #27
Can somebody please create the site scraper for this website?
http://www.sakitvs.com/indiantv.htm

Thanks

G-Box Midnight with XBMC Midnight Linux v0.2b by Static
i5 iMac 21.5"
ATV 1
Acer Aspire Netbook
42" LG SMART TV
iPhone 5/xbmc
"Please don't forget to rate my reputation if you are satisfied with my help."

[Image: avatar_1672.gif]
find quote
akuiraz Offline
Member
Posts: 77
Joined: Apr 2010
Reputation: 0
Post: #28
Is there a replacement for scrap.exe as it seems to be retired? anything to make editing and testing the scraper along the way would be extremely helpful... i'm on windows. thanks in advance.

[Image: widget]
find quote
Spongeroberto Offline
Junior Member
Posts: 4
Joined: Apr 2012
Reputation: 0
Post: #29
Hello


I'm trying to create my own scraper but I'm stuck at the part where I put the scraper in XBMC. Putting it in the "C:\Program Files (x86)\XBMC\system\scrapers\video" folder doesn't make it appear in XBMC. Even if I just copy the 'dummy.xml' from the guide and put it in that folder, it doesn't show up, so it isn't an issue with my own little scraper.

Any ideas here?
find quote
DiMag Offline
Senior Member
Posts: 209
Joined: Nov 2012
Reputation: 0
Post: #30
The likn to your tool to test scrapers, reported on page 1, is dead. As it is perhaps the most useful tool to any would-be scraper developer, can you update it/provide an alternative?
find quote