Using WebScraper
- Open a wire sheet view.
- Click and drag the WebScraper/parsers from the palette onto the wire sheet.
Any number of parsers, regardless of type, can be used with a single WebScraper.
- Connect the 'Out' slot of the WebScraper to the 'html In' slot of any parsers.
- Double-click on WebScraper to access its property sheet. Enter the desired URL.
Unlike in a web browser, in WebScrape URLs the protocol (e.g. http://) is required.
- If the website requires authentication, set "Basic Authentication" to true and enter your username and password in the spaces provided.
- Right-click on your WebScraper (either on the top line of the property sheet or on the wire sheet view) and select Action Get HTTP. WebScraper can also be set to update the source code after a certain time interval passes. To do this, expand the Interval property in your WebScraper's property sheet.
Using RegEx and XPath
In the Property view of RegEx/XPath object:
- Set isNumeric to true if you intend to use numeric data or false otherwise.
- Enter your regular expression/xpath in the "Expression" field.
Regular Expressions tutorial: http://download.oracle.com/javase/tutorial/essential/regex/
XPath tutorial: http://www.w3schools.com/xpath/default.asp
- Invoke the "Parse" action.
By default, both RegEx and XPath "Parse" when the html In property is changed.
Examples
Example 1: Current temperature in Charlotte, NC from http://weatherunderground.com
Create a new folder. Add a WebScraper and an XPath parser to your wire sheet.
- Connect WebScraper.Out to XPath.html In
- Open the property sheet of the WebScraper and enter the URL (http://www.wunderground.com/US/NC/Charlotte.html) in the URL field. Right-click WebScraper and select "Get Http."
- Open the property sheet of the XPath parser. Identify the XPath for the temperature on the website and enter it into the expression field of the parser. In this case, the XPath we can use is //*[@id="tempActual"]
Remember to check isNumeric!
- Right-click on the XPath parser and select Action Parse. Your result should appear in a new slot.
Example 2: Current condition (clear, overcast, etc.) from WU.com
- On the same wire sheet as in the previous example, add a new XPath parser.
- Connect WebScraper.Out to the new XPath's Html In
- Identify the XPath for current condition and enter it into the expression field of the parser. (//*[@id="currCond"])
- Invoke the action "Parse"
, multiple selections available,