Web Scraper and Web Macros FAQs

Page History: How to Build a Package


Compare Page Revisions



« Older Revision - Back to Page History - Newer Revision »


Page Revision: 2009/05/13 19:24


This method of building a package uses a template that can be found here:



Navigation:

The idea is to test navigation to each type of page on the site before worrying about extracting the data, because if you can't get to it, you can't extract it.
  • Insert the URL or POST that navigates to the first page you need to extract into the template Package->Steps tab->Listings 1st Page->File/Form List. This URL is typically to a top level page like first page of listings or top level category page. Try and get as close as you can to the details you need to extract. If you can get to the detail pages directly by figuring out a pattern in the URL's, do it.
  • Run the package and double click the URL in the window that pops up. Make sure the downloaded file has the data you need. If not, try creating a step before this one that navigates to the home page or blank search page so that a cookie for the site can be obtained. If that doesn't work, you may need to change some HTTP headers in Package->Steps tab->YourStepName->Advanced tab->Http Client. Use an HTTP sniffer to find out what these should be. Cookie then Referral URL then and User Agent are the most important.
  • Repeat this process for sample URL's of other types of pages you need to navigate down to, until you get to the details pages with the information you need.
  • Once you can navigate to all these pages, you need to make the extraction templates known as datapages to get the data into a database, Excel, Access or other tabular format.

Extraction:

  • In Web Scraper, click the datapages icon on the left and open up the existing Counter Datapage.
  • Go back to Web Scraper and click the packages icon on the left and right click->run the package created above.
  • Click the highest URL that you need information from. (Usually the top one, unless you made a cookie step.)
  • Copy the URL from the browser that opens up into the datapage editor and hit enter.
  • In the top menu, choose datapage->New. (If the file takes a long time to load, you may need to edit the html to take out javascript or frames that is making thigs hang. <script tags are replaced with <scpt in the pacakge, but <a href="javascript:doSomethingThatMakesThingsHang()..." may also exist.
  • Name your datapage and ignore the database tab. NEVER, EVER place anything in the field on that tab.
  • Create a dataset
    • From the drop down select Dataset -> Add
    • Select the number of rows of data that appear on this page from the drop down (Note: the datapage will still work later if there is a different number of rows on a different page.)
    • Name the dataset, and this time enter database information so that datapage knows where to record the information it scrapes. Edit the meta tags as you want. I prefer META_SRC_URI and META_PKG_STRT be left turned on
  • Create fields
    • Now highlight a little before and after the first field you want to extract in the browser window. If you have issues with overlapping tags, refer here.
    • In the source code window below, highlight exactly what you want to be scraped
    • From the drop down select Field -> Add
    • Name the field
    • Repeat for all fields for this dataset

Note: you can also go back and add additional dataset if there is data on this page that you want written to different tables altogether.
PoweredBy
Create a Page | Administration | File Management | Login/Logout | Language Selection | Your Profile |Create Account