Web Scraper and Web Macros FAQs

If data is on different pages of a site, this means you have to use different datapages and therefore separate tables. (in some very rare instances you can write to the same table from several datapages, but this would mean you were scraping the exact same information from different page formats, which is not what would happen in this scenario)

So how do you join the data back together after its scattered to two or more tables? The short answer is by using SQL Joins. Assuming the pages are somehow linked together, we can use the information about those links that we scraped to join the tables together. Imagine the following scenario of three datapages:

Main Page Datapage
SQL Table Name: Main
Fields: f1, f2, f3, Link1, Link2, META_SRC_URI


First Additional Datapage
SQL Table Name: Details1
Fields: f1, f2, META_SRC_URI


Second Additional Datapage
SQL Table Name: Details2
Fields: f1, f2, META_SRC_URI

Link1 and Link2 form the main datapage lead to the two additional pages. The META_SRC_URI field is a META tag that indicates the URL where the data originated from. In the query editor you would do this to join things together:

SELECT Main.f1 , Main.f2 , Main.f3 , Details1.f1 as f4 , Details1.f2 as f5 , Details2.f1 as f6 , Details2.f2 as f7 , Main.META_SRC_URI as MainPageSourceLocation , Details1.META_SRC_URI as Details1SourceLocation , Details2.META_SRC_URI as Details2SourceLocation FROM Main INNER JOIN Details1 ON Main.Link1 = Details1.META_SRC_URI INNER JOIN Details2 ON Main.Link2 = Details2.META_SRC_URI

If you don't have any SQL background then this can look a bit overwhelming. Let me break this down step by step

SELECT Main.f1, Main.f2, ... , Details2.META_SRC_URI as Details2SourceLocation
This "select clause" is specifying every column that we want in our table. If we were just pulling form one table we could get away with just using our field names (f1, f2, f3) but since we're pulling form multiple table we need to specify a table name in front of the column name with a . (Main.f1, Mains.f2, Main.f3)

FROM Main
This specifies which table we are pulling from. You can actually specify more than one table here, but that syntax is a little more confusing. So instead a listed additional tables in their own Join clauses.

INNER JOIN Details1 ON Main.Link1 = Details1.META_SRC_URI INNER JOIN ...
The two Join clases specify how to match up the rows between all three tables. We specify that the first link that we scraped must match the META_SRC_URI of the first Detailed page. Likewise the second link must match up with the url of the second details page.

NOTE:
This is not changing anything in any tables. If you run a query like this in the query editor, you will see a joined table, but no tables are actually modified. Just use the export to excel button with the correct query to save your joined results to excel. Or you can look into SQL's "SELECT INTO" syntax if you want your results in another table.



PoweredBy
Create a Page | Administration | File Management | Login/Logout | Language Selection | Your Profile |Create Account