Kodi Community Forum

Full Version: How do you make a site scrapeable?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
This might sound like a very uneducated question, but how and with what markup language should one write a website so that it's;
a) easy to scrape, well organized..
b) one is still able to style it, whether it's with css or xls or...

I don't need/except some to give me a complete tutorial, how to make, but more some pointers as to which markup I should look into.

I've tried looking at thetvdb.com and themoviedb.org, but I epically :p fail to understand with what it's been written.
both of those offer xml based api's.

the only thing needed to make a site scrapeable is a pattern that can be described using regular expressions. repeatability is the key..
Thank you for the swift reply.. Smile

Ok, I've already tried to make a mockup of the structure, using xml. (not final). So my follow-up question will be simply be, how does this look? From a scraping view-point?
(Thinking about creating a gamedb) Big Grin


Code:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="simple.xsl" ?>

<GAME_LIBRARY>

<GAME>
<GAME_ID>id of current game in list</GAME_ID>

<!-- Could possible use the XBE ID tag??? or just a general id to tag everything for easy scraping -->

<GAME_TITLE>
    <HEADER>Title</HEADER>
    <NAME>Halo: Combat Evolved</NAME>
    <IMG_URL>path to poster image</IMG_URL>
</GAME_TITLE>

<GAME_DEVELOPER>
    <HEADER>Developer(s)</HEADER>
    <NAME>Bungie Studios</NAME>
    <IMG_URL>path to logo</IMG_URL>
</GAME_DEVELOPER>

<GAME_PUBLISHER>
    <HEADER>Publisher(s)</HEADER>
    <NAME>Microsoft Game Studios</NAME>
    <IMG_URL>path to logo</IMG_URL>
</GAME_PUBLISHER>

<GAME_PLATFORM>
    <HEADER>Platform(s)</HEADER>
    <NAME>Xbox</NAME>
    <IMG_URL>path to logo</IMG_URL>
    <NAME>PC</NAME>
    <IMG_URL>path to logo</IMG_URL>
</GAME_PLATFORM>

<GAME_RELEASED>
    <HEADER>Released</HEADER>
    <YEAR>2001</YEAR>
    <MONTH>November</MONTH>
    <DATE>15</DATE>
</GAME_RELEASED>

<GAME_GENRE>
    <HEADER>Genre</HEADER>
    <SHORTHAND>FPS</SHORTHAND>
    <LONG>First-person Shooter</LONG>
</GAME_GENRE>

<GAME_SYNOPSIS>
    <HEADER>Synopsis</HEADER>
<SYNOPSIS>
Enter the mysterious world of Halo, an alien planet shaped like a ring. As mankind's super soldier Master Chief, you must uncover the secrets of Halo and fend off the attacking Covenant. During your missions, you'll battle on foot, in vehicles, inside, and outside with alien and human weaponry. Your objectives include attacking enemy outposts, raiding underground labs for advanced technology, rescuing fallen comrades, and sniping enemy forces. Halo also lets you battle three other players via intense split screen combat or fight cooperatively with a friend through the single-player missions.Enter the mysterious world of Halo, an alien planet shaped like a ring. As mankind's super soldier Master Chief, you must uncover the secrets of Halo and fend off the attacking Covenant. During your missions, you'll battle on foot, in vehicles, inside, and outside with alien and human weaponry. Your objectives include attacking enemy outposts, raiding underground labs for advanced technology, rescuing fallen comrades, and sniping enemy forces. Halo also lets you battle three other players via intense split screen combat or fight cooperatively with a friend through the single-player missions.
</SYNOPSIS>
</GAME_SYNOPSIS>

<!-- COMMENT//Possibility of dynamicly aquiring from gamerankings.com? -->

<GAME_RATING>
    <TEXT>95</TEXT>
    <IMG_URL>path to maybe generated img file</IMG_URL>
</GAME_RATING>

<!-- COMMENT//From gamerankings.com, percent rating from 0-100% -->

<GAME_VGCRS>
    <HEADER>Rated</HEADER>
    <ESRB>M</ESRB>
    <ESRB_URL>path to image of rating</ESRB_URL>
    <BBFC>15</BBFC>
    <BBFC_URL>path to image of rating</BBFC_URL>
    <PEGI>16+</PEGI>
    <PEGI_URL>path to image of rating</PEGI_URL>
    <USK>16</USK>
    <USK_URL>path to image of rating</USK_URL>
    <OFLC>MA15+</OFLC>
    <OFLC_URL>path to image of rating</OFLC_URL>
</GAME_VGCRS>

<GAME_VGCRS_DESC>
    <HEADER>not sure why this would be required</HEADER>
    <TEXT_1>Blood and gore</TEXT_1>
    <IMG_1>path to blood and gore image</IMG_1>
    <TEXT_2>Violence</TEXT_2>
    <IMG_2>path to violence image</IMG_2>
</GAME_VGCRS_DESC>
</GAME>

</GAME_LIBRARY>
my eyes hurt, please drop the upper case. don't see the point of the <header> entries, not that it matters since they can easily be skipped.

the platform tags should be xml'ized, i.e. just have multiple
<platform>
..
</platform>
<platform>
..
</platform>

instead of using img_url then name then img_url then name... much easier to parse and much more xml'ish.

brief overlook only mind you
dropping the upper case now..

the header -tags is for display purposes, thought it would be better if they were also scrape able. But if they are unnecessary then by all means they will be cut out. No need to have data that isn't going to be scraped anyway.

(Now to just figure out how to display images in browser based on URL data only)

Going to review the platform -tag and consequently the game_vgcrs -tag for better formatting. Thank you!

Other than that we're good, this is scrape able with ease? Smile
yeah, xml is piece of cake Smile
Maybe for you. Mind you I have never done anything with XML so this is a first for me. I like a challenge though so.. Big Grin

A final quicky if I may though? what formatting do you need to use to make the URL path in the XML easy to scrape but also so that it displays in browser?
I can't wrap my head around it, it seems easy to make it scrape able but in browser view it just displays the path, not the actual image (styling with CSS)
url format does not matter. you just need to remember, you are storing xml, so you need to escape special chars, in particular;

& -> &amp;
" -> &quot; (prob not relevant in a url)
I'll keep that in mind. Yes I'm aware that formatting doesn't matter when storing URL data, but it matters when I want to be able to display it in a browser as well. And that's what I'm having issues with currently, then again, I should probably ask somewhere else for that...
asphinx Wrote:I've tried looking at thetvdb.com and themoviedb.org, but I epically :p fail to understand with what it's been written.
Both thetvdb.com and themoviedb.org originally used the same open source website framework as their base and that website framework is available for anyone to download and use for free, see => http://sourceforge.net/projects/tvdb
Quote:TVDB - Online TV Database

A web/XML interface and database schema for managing TV series information and user-submitted graphics. Will be interfaced by a number of HTPC plugins and software. Currently used by plugins for Meedio, Media Portal, and XBox Media Center.

Why reinvent the wheel if you don't have to? Wink
Excellent! I just downloaded the 0.3, going to have a look if this fits my purpose good enough for it to be something to build on/learn from! Big Grin
Checked the tvdb framework, after my initial reaction of "what the hell" had settled, I started snooping around. But unfortunately as of now, I am nowhere near grasping the structural layout and inner workings that is thetvdb 0.3.

But I have been working on a prototype that might just do what's necessary anyway. I do however have a question (as I know next to nothing about scraping xml)

If say, an xml data container, contains the following php

Code:
<?php echo date (”Y-m-j h:i:s A (T)”, getlastmod()); ?>

does the scraper only read the data (as is) or is it possible to get the actual result of the echo instead, in this case the last modified date?
php is server side..
uuuh.. yes. I know that much. Doesn't really answer my question though, or I'm too dumb to realize that it does. But I am going to assume that the answer is no, and try to find an alternative solution. Thank you for the reply though. Smile
asphinx Wrote:Checked the tvdb framework, after my initial reaction of "what the hell" had settled, I started snooping around. But unfortunately as of now, I am nowhere near grasping the structural layout and inner workings that is thetvdb 0.3.

But I have been working on a prototype that might just do what's necessary anyway. I do however have a question (as I know next to nothing about scraping xml)

If say, an xml data container, contains the following php

Code:
<?php echo date (”Y-m-j h:i:s A (T)”, getlastmod()); ?>

does the scraper only read the data (as is) or is it possible to get the actual result of the echo instead, in this case the last modified date?

0.3 is REALLY old. Sorry about that. We're in the middle of completely overhauling our database and moving it to Postgres (most likely) instead of MySQL, which will make most of the SQL exist only in stored procedures on the database. It should really simplify things.

You missed the point that spiff was making. That PHP code you see will get turned into a date on the server and returned. Client side users and applications won't see any of the PHP code itself.

I'd be happy to give you a bit of guidance for your project... just send me a PM. The main thing is that you'll really need a grasp of PHP and SQL (mysql or Postgres). Also keep in mind that once XBMC and Media Portal hit your site, you're looking at a TON of bandwidth and CPU usage so you'll really need a dedicated host when that time comes (instead of a shared host, which is what many website hosting companies provide).
Pages: 1 2