meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
pluginto:webpage_scraping [2021/10/23 04:19] – created gregbalcopluginto:webpage_scraping [2022/06/04 03:52] (current) gregbalco
Line 5: Line 5:
 For example, instead of having your program query the database for raw data and then assemble it into calculator input format, you can often have the web server do it.  For example, instead of having your program query the database for raw data and then assemble it into calculator input format, you can often have the web server do it. 
  
-Let's say you want the calculator input for [[http://antarctica.ice-d.org/sample/10-MPS-006-COU|a sample called 10-MPS-006-COU]] (which is notable because 12 different nuclide concentration measurements have been made on it). The webpage generated for that sample includes the formatted calculator input. If you are looking at the page in a browser, you can just copy and paste it whereever you want. For convenience, however, that block of text is delimited in the HTML code by hidden tags like:+Let's say you want the calculator input for [[http://version2.ice-d.org/antarctica/sample/10-MPS-006-COU|a sample called 10-MPS-006-COU]] (which is notable because 12 different nuclide concentration measurements have been made on it). The webpage generated for that sample includes the formatted calculator input. If you are looking at the page in a browser, you can just copy and paste it wherever you want. For convenience when you want a computer program to do it, however, that block of text is delimited in the HTML code by hidden tags like:
  
-''<!-- begin v3 --><pre>....<!-- end v3 -->''+<code> 
 +<!-- begin v3 --><pre>....</pre><!-- end v3 --> 
 +</code>
  
 So your script can just look for those in the HTML and pull out what is between them. In MATLAB, for example,  So your script can just look for those in the HTML and pull out what is between them. In MATLAB, for example, 
  
 +<code>
  
-''urls ['http://antarctica.ice-d.org/sample/10-MPS-006-COU' site_name]; +% Read a webpage into a string 
-s = webread(urls);+webread('http://version2.ice-d.org/antarctica/sample/10-MPS-006-COU');
  
 +% Extract the formatted input data
 l1 = '<!-- begin v3 --><pre>'; l1 = '<!-- begin v3 --><pre>';
 l2 = '</pre><!-- end v3 -->'; l2 = '</pre><!-- end v3 -->';
-v3_input_string = s((strfind(s,l1)+length(l1):strfind(s,l2)-1));''+v3_input_string = s((strfind(s,l1)+length(l1):strfind(s,l2)-1)); 
 + 
 +</code> 
 + 
 +That should produce this result: 
 + 
 +<code> 
 +v3_input_string = 
 + 
 +    '10-MPS-006-COU -83.28515 -57.97676 923 ant  4.5 2.60 0.9945 0 0; 
 +     10-MPS-006-COU Be-10 quartz 3.065e+06 4.168e+04 07KNSTD; 
 +     10-MPS-006-COU He-3 quartz 7.187e+06 4.083e+05 CRONUS-P 5.190e+09; 
 +     10-MPS-006-COU He-3 quartz 5.462e+06 4.312e+05 CRONUS-P 5.190e+09; 
 +     10-MPS-006-COU He-3 quartz 6.843e+06 5.013e+05 CRONUS-P 5.190e+09; 
 +     10-MPS-006-COU He-3 quartz 6.701e+06 5.219e+05 CRONUS-P 5.190e+09; 
 +     10-MPS-006-COU He-3 quartz 6.250e+06 2.900e+05 CRONUS-P 4.800e+09; 
 +     10-MPS-006-COU Ne-21 quartz 2.546e+07 2.846e+06 CRONUS-A 3.330e+08; 
 +     10-MPS-006-COU Ne-21 quartz 2.543e+07 2.802e+06 CRONUS-A 3.330e+08; 
 +     10-MPS-006-COU C-14 quartz 5.165e+05 1.198e+04; 
 +     10-MPS-006-COU C-14 quartz 3.239e+05 4.550e+03; 
 +     10-MPS-006-COU C-14 quartz 2.624e+05 3.509e+03; 
 +     10-MPS-006-COU C-14 quartz 2.364e+05 3.957e+03; 
 +     ' 
 +</code> 
 + 
 +which you can then use to calculate exposure ages, or whatever. 
 + 
 +Likewise, a webpage that contains Cl-36 data will have the text input data as a separate formatted block with tags that look like: 
 + 
 +<code> 
 +<!-- begin Cl36 --><pre> ... </pre><!-- end Cl36 --> 
 +</code> 
 + 
 + 
 +which you can extract from the HTML string similarly.  
 + 
 +