2009-09-04, 08:31
Scraping
[HTML][/HTML]
When writing plugins, part of your job is getting content from a website. If you're in luck, everything you need is nicely presented in one or more RSS feeds. If this is not the case you'll have to extract the info you need from the webpages within a website. To do this you have to dive into the HTML source of a webpage. Fortunately there are some handy tools and add ons available to help you find the data you want.
Firefox Add-ons
XPath Checker
To gather data from a webpage in your Python code, you use a lot of xpath expressions. With this add-on you can easily try your expressions and check if they collect the right data before implementing it in your Python code.
XPather
Requires: DOM inspector plugin
...generates XPaths while browsing or inspecting HTML/XML/*ML documents; evaluates your XPaths and inspects the results; extracts the content.
The XPather is a simple Firefox extension that integrates both with the browser and its DOMInspector. Thus, is't very lightweight and cross-platform. It is valuable mainly as a web/XML-app development and hacking tool.
HttpFox
HttpFox logs all HTTP connections and gives a nice overview of what files are being downloaded. This can help you find paths to or URLs from images and videos.
Webdeveloper Toolbar
This toolbar offers a *lot* of options. A couple of nice ones that might come in handy are:
Firebug & Firebug Extensions
Firebug integrates with Firefox to put a wealth of development tools at your fingertips while you browse. You can edit, debug, and monitor CSS, HTML, and JavaScript live in any web page. Firebug have also nice extensions.
Fiddler Wed Debugger
Fiddler is a Web Debugging Proxy which logs all HTTP(S) traffic between your computer and the Internet. Fiddler allows you to inspect all HTTP(S) traffic, set breakpoints, and "fiddle" with incoming or outgoing data. Fiddler includes a powerful event-based scripting subsystem, and can be extended using any .NET language.
Fiddler is freeware and can debug traffic from virtually any application, including Internet Explorer, Mozilla Firefox, Opera, and thousands more.
Internet Explorer Developer Toolbar
Built in on IE 8. For IE 6 & 7 here
Developers have all of the tools that they need, right out of the box without the need to download or install anything. By simply hitting the F12 key to bring them up, developers have access to:
Source : Plex Wiki
[HTML][/HTML]
When writing plugins, part of your job is getting content from a website. If you're in luck, everything you need is nicely presented in one or more RSS feeds. If this is not the case you'll have to extract the info you need from the webpages within a website. To do this you have to dive into the HTML source of a webpage. Fortunately there are some handy tools and add ons available to help you find the data you want.
Firefox Add-ons
XPath Checker
To gather data from a webpage in your Python code, you use a lot of xpath expressions. With this add-on you can easily try your expressions and check if they collect the right data before implementing it in your Python code.
XPather
Requires: DOM inspector plugin
...generates XPaths while browsing or inspecting HTML/XML/*ML documents; evaluates your XPaths and inspects the results; extracts the content.
The XPather is a simple Firefox extension that integrates both with the browser and its DOMInspector. Thus, is't very lightweight and cross-platform. It is valuable mainly as a web/XML-app development and hacking tool.
HttpFox
HttpFox logs all HTTP connections and gives a nice overview of what files are being downloaded. This can help you find paths to or URLs from images and videos.
Webdeveloper Toolbar
This toolbar offers a *lot* of options. A couple of nice ones that might come in handy are:
- Outline > Outline Current Element: draws a rectangle around the mouse hoverd item and displays the path to the element in the document
- Information > Display Element Information: gives you information about the element itself, but also about its parent and child elements
- View Source > View Generated Source: Can't find an element you're looking for? Chances are the element was dynamically added to the webpage (for instance a Flash player that was put in on page load with the popular SWFObject javascript). With this option you can view the source after javascripts may have changed it.
Firebug & Firebug Extensions
Firebug integrates with Firefox to put a wealth of development tools at your fingertips while you browse. You can edit, debug, and monitor CSS, HTML, and JavaScript live in any web page. Firebug have also nice extensions.
Fiddler Wed Debugger
Fiddler is a Web Debugging Proxy which logs all HTTP(S) traffic between your computer and the Internet. Fiddler allows you to inspect all HTTP(S) traffic, set breakpoints, and "fiddle" with incoming or outgoing data. Fiddler includes a powerful event-based scripting subsystem, and can be extended using any .NET language.
Fiddler is freeware and can debug traffic from virtually any application, including Internet Explorer, Mozilla Firefox, Opera, and thousands more.
Internet Explorer Developer Toolbar
Built in on IE 8. For IE 6 & 7 here
Developers have all of the tools that they need, right out of the box without the need to download or install anything. By simply hitting the F12 key to bring them up, developers have access to:
- DOM manipulation & HTML tree view
- CSS tracing
- JavaScript debugging
- JavaScript profiling
Source : Plex Wiki