Scraping helper tools for plugin coders
#1
Scraping
[HTML][/HTML]
When writing plugins, part of your job is getting content from a website. If you're in luck, everything you need is nicely presented in one or more RSS feeds. If this is not the case you'll have to extract the info you need from the webpages within a website. To do this you have to dive into the HTML source of a webpage. Fortunately there are some handy tools and add ons available to help you find the data you want.

Firefox Add-ons
XPath Checker
To gather data from a webpage in your Python code, you use a lot of xpath expressions. With this add-on you can easily try your expressions and check if they collect the right data before implementing it in your Python code.
Image

XPather
Requires: DOM inspector plugin
...generates XPaths while browsing or inspecting HTML/XML/*ML documents; evaluates your XPaths and inspects the results; extracts the content.

The XPather is a simple Firefox extension that integrates both with the browser and its DOMInspector. Thus, is't very lightweight and cross-platform. It is valuable mainly as a web/XML-app development and hacking tool.

HttpFox
HttpFox logs all HTTP connections and gives a nice overview of what files are being downloaded. This can help you find paths to or URLs from images and videos.
Image

Webdeveloper Toolbar
This toolbar offers a *lot* of options. A couple of nice ones that might come in handy are:
  • Outline > Outline Current Element: draws a rectangle around the mouse hoverd item and displays the path to the element in the document
Image
  • Information > Display Element Information: gives you information about the element itself, but also about its parent and child elements
Image
  • View Source > View Generated Source: Can't find an element you're looking for? Chances are the element was dynamically added to the webpage (for instance a Flash player that was put in on page load with the popular SWFObject javascript). With this option you can view the source after javascripts may have changed it.

Firebug & Firebug Extensions
Firebug integrates with Firefox to put a wealth of development tools at your fingertips while you browse. You can edit, debug, and monitor CSS, HTML, and JavaScript live in any web page. Firebug have also nice extensions.
Image

Fiddler Wed Debugger
Fiddler is a Web Debugging Proxy which logs all HTTP(S) traffic between your computer and the Internet. Fiddler allows you to inspect all HTTP(S) traffic, set breakpoints, and "fiddle" with incoming or outgoing data. Fiddler includes a powerful event-based scripting subsystem, and can be extended using any .NET language.

Fiddler is freeware and can debug traffic from virtually any application, including Internet Explorer, Mozilla Firefox, Opera, and thousands more.
Image

Internet Explorer Developer Toolbar
Built in on IE 8. For IE 6 & 7 here
Developers have all of the tools that they need, right out of the box without the need to download or install anything. By simply hitting the F12 key to bring them up, developers have access to:
  • DOM manipulation & HTML tree view
  • CSS tracing
  • JavaScript debugging
  • JavaScript profiling

Source : Plex Wiki
Reply
#2
Thanks for the info. Live HTTP Headers, Firebug, and Wireshark are also pretty useful.
Reply
#3
Thanks for information on these tools.

I tried the web developer toolbar for below link which uses javascript:

http://www.rajshri.com/midpage.aspx?cntid=33663

when we use regular source, and we try to go to pages 2, 3 etc in above link, it will still give source of only first page episodes- by using the web developer toolbar- view source-> generated source - it gives me source after java script has executed and shows source of episodes listed in other pages92.3 etc) which is very nice, but how can we get URL of each page in the match=re.compile('') - commands ? or what would be the method to follow for such sites where the parent URL, has listed pages inside frames or java script etc..

Please let me know.

Thanks
Reply
#4
Anyone on a solution for javascript as per above situation ?
Reply
#5
Thanks for taking the initiative guys! Most useful.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#6
stacked Wrote:Thanks for the info. Live HTTP Headers, Firebug, and Wireshark are also pretty useful.
Agreed, wireshark and firebug make an awesome combo.

The XPath Checker looks useful, thanks for that one!
Quote:? or what would be the method to follow for such sites where the parent URL, has listed pages inside frames or java script etc..
That information is coming from somewhere. Using firebug, wireshark, and attempting to read the javascript itself are the best methods to find out where.
Always read the XBMC online-manual, FAQ and search and search the forum before posting.
For troubleshooting and bug reporting please read how to submit a proper bug report.

If you're interested in writing addons for xbmc, read docs and how-to for plugins and scripts ||| http://code.google.com/p/xbmc-addons/
Reply
#7
Fiddler Wed Debugger
http://www.fiddler2.com/fiddler2/

Internet Explorer 8 - integrated Developer Tools (F12)
http://www.microsoft.com/windows/interne...fault.aspx

Internet Explorer 6 & 7 - Internet Explorer Developer Toolbar
http://www.microsoft.com/downloadS/detai...laylang=en
Reply
#8
Update:
added Fiddler, IE Dev Toolbar and XPather
Reply
#9
@queeup
Should try the 2.2 beta Fiddler as well, it has a few improvements.
Reply
#10
Can anyone please let me know how to get video information shown under each pages which is using javascript using python ?

link :

http://www.rajshri.com/landing.aspx?lgid=274&catid=257

Using above tools like firebug etc, I am able to get the video links after the java script is executed, but How do I code in python to get the information of videos after a java script is executed for example below video reference is found from page 3 of above link after the JavaScript is executed - How do I code in python to get this information ?


Code:
<a href="moviesmidpage.aspx?cntid=9196" id="ctlLandingContent_rptContent_ctl02_ahrefHeader" class="red_txt12" style="padding-left: 5px;">
                                   Hera Pheri</a>

Please let me know .

Thanks
Reply
#11
Sorry @sansat but this is not Help and Support forum Sad Please try Plugin/Script (Python) Help and Support with new thread.
Reply

Logout Mark Read Team Forum Stats Members Help
Scraping helper tools for plugin coders0