meta data for this page
  •  

This is an old revision of the document!


Scraping data from ICE-D webpages

This outlines some tricks that can be used to make the web server do some of the work for you.

For example, instead of having your program query the database for raw data and then assemble it into calculator input format, you can often have the web server do it.

Let's say you want the calculator input for a sample called 10-MPS-006-COU (which is notable because 12 different nuclide concentration measurements have been made on it). The webpage generated for that sample includes the formatted calculator input. If you are looking at the page in a browser, you can just copy and paste it whereever you want. For convenience, however, that block of text is delimited in the HTML code by hidden tags like:

<!-- begin v3 --><pre>....</pre><!-- end v3 -->

So your script can just look for those in the HTML and pull out what is between them. In MATLAB, for example,

% Read a webpage into a string
s = webread('http://antarctica.ice-d.org/sample/10-MPS-006-COU');

% Extract the formatted input data
l1 = '<!-- begin v3 --><pre>';
l2 = '</pre><!-- end v3 -->';
v3_input_string = s((strfind(s,l1)+length(l1):strfind(s,l2)-1));

That should produce this result:

v3_input_string =

    '10-MPS-006-COU -83.28515 -57.97676 923 ant  4.5 2.60 0.9945 0 0;
     10-MPS-006-COU Be-10 quartz 3.065e+06 4.168e+04 07KNSTD;
     10-MPS-006-COU He-3 quartz 7.187e+06 4.083e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 5.462e+06 4.312e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 6.843e+06 5.013e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 6.701e+06 5.219e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 6.250e+06 2.900e+05 CRONUS-P 4.800e+09;
     10-MPS-006-COU Ne-21 quartz 2.546e+07 2.846e+06 CRONUS-A 3.330e+08;
     10-MPS-006-COU Ne-21 quartz 2.543e+07 2.802e+06 CRONUS-A 3.330e+08;
     10-MPS-006-COU C-14 quartz 5.165e+05 1.198e+04;
     10-MPS-006-COU C-14 quartz 3.239e+05 4.550e+03;
     10-MPS-006-COU C-14 quartz 2.624e+05 3.509e+03;
     10-MPS-006-COU C-14 quartz 2.364e+05 3.957e+03;
     '

which you can then use to calculate exposure ages, or whatever.

A webpage that contains Cl-36 data will have that as a separate formatted block with tags that look like

<!-- begin Cl36 --><pre></pre><!-- end Cl36 -->

In a webpage that contains exposure age results, the XML returned by the exposure age calculator is also included in a hidden tag:

<!-- begin_xml_dump <XML GOES HERE> end_xml_dump -->

So you can extract it from the HTML string using a similar approach and do something with it.