Web Scraper and Web Macros FAQs

This method of building a package uses a template that can be found here:



Edit

Navigation:

The idea is to test navigation to each type of page on the site before worrying about extracting the data, because if you can't get to it, you can't extract it.
  • Insert the URL or POST that navigates to the first page you need to extract into the template Package->Steps tab->Listings 1st Page->File/Form List. This URL is typically to a top level page like first page of listings or top level category page. Try and get as close as you can to the details you need to extract. If you can get to the detail pages directly by figuring out a pattern in the URL's, do it.
  • Run the package and double click the URL in the window that pops up. Make sure the downloaded file has the data you need. If not, try creating a step before this one that navigates to the home page or blank search page so that a cookie for the site can be obtained. If that doesn't work, you may need to change some HTTP headers in Package->Steps tab->YourStepName->Advanced tab->Http Client. Use an HTTP sniffer to find out what these should be. Cookie then Referral URL then and User Agent are the most important.
  • Repeat this process for sample URL's of other types of pages you need to navigate down to, until you get to the details pages with the information you need.
  • Once you can navigate to all these pages, you need to make the extraction templates known as datapages to get the data into a database, Excel, Access or other tabular format.

Edit

Extraction:

  • In Web Scraper, click the datapages icon on the left and open up the existing Counter Datapage.
  • Go back to Web Scraper and click the packages icon on the left and right click->run the package created above.
  • Click the highest URL that you need information from. (Usually the top one, unless you made a cookie step.)
  • Copy the URL from the browser that opens up into the datapage editor and hit enter.
  • In the top menu, choose datapage->New. (If the file takes a long time to load, you may need to edit the html to take out javascript or frames that is making thigs hang. <script tags are replaced with <scpt in the pacakge, but <a href="javascript:doSomethingThatMakesThingsHang()..." may also exist.
  • Name your datapage and ignore the database tab. NEVER, EVER place anything in the field on that tab.
  • Create a dataset
    • From the drop down select Dataset -> Add
    • Select the number of rows of data that appear on this page from the drop down (Note: the datapage will still work later if there is a different number of rows on a different page.)
    • Name the dataset, and this time enter database information so that datapage knows where to record the information it scrapes. Edit the meta tags as you want. I prefer META_SRC_URI and META_PKG_STRT be left turned on
  • Create fields
    • Now highlight a little before and after the first field you want to extract in the browser window. If you have issues with overlapping tags, refer here.
    • In the source code window below, highlight exactly what you want to be scraped
    • From the drop down select Field -> Add
    • Name the field
    • Repeat for all fields for this dataset
  • Try a test extraction, by hitting the play button
  • If you aren't getting any rows, go into your dataset properties and click the Edit(Advanced) link at the bottom
    • Change the initialization string to a string that occurs before every row on the page (If you're having trouble finding the correct HTML to use try doing a test extraction and clicking the Step-By-Step Replay tab at the bottom. Then look at the HTML in that window to form your initialization strings and start end tags. Usually this is the same as the HTML that shows up in the bottom pane when you select elements on the page. However, there are cases when that does not work and using the Step-By-Step Replay tab does.
  • If you get blank rows or missing fields, go into the field properties and click Tags(Manual), then check Enable Manual Override and change the tags
  • You go back and add additional dataset if there is data on this page that you want written to different tables altogether.
  • Now add the datapage to your package by double clicking the package->go to steps tab->double click the step that downloads the files and click the datapage tab. Then click the Select Datapage button and find your datapage. (You can sort by Create Date to find your most recent more quickly.)
PoweredBy
Create a Page | Administration | File Management | Login/Logout | Language Selection | Your Profile |Create Account