meta data for this page

Scraping data from ICE-D webpages

This outlines some tricks that can be used to make the web server do some of the work for you.

For example, instead of having your program query the database for raw data and then assemble it into calculator input format, you can often have the web server do it.

Let's say you want the calculator input for a sample called 10-MPS-006-COU (which is notable because 12 different nuclide concentration measurements have been made on it). The webpage generated for that sample includes the formatted calculator input. If you are looking at the page in a browser, you can just copy and paste it wherever you want. For convenience when you want a computer program to do it, however, that block of text is delimited in the HTML code by hidden tags like:

<!-- begin v3 --><pre>....</pre><!-- end v3 -->

So your script can just look for those in the HTML and pull out what is between them. In MATLAB, for example,

% Read a webpage into a string
s = webread('http://version2.ice-d.org/antarctica/sample/10-MPS-006-COU');

% Extract the formatted input data
l1 = '<!-- begin v3 --><pre>';
l2 = '</pre><!-- end v3 -->';
v3_input_string = s((strfind(s,l1)+length(l1):strfind(s,l2)-1));

That should produce this result:

v3_input_string =

    '10-MPS-006-COU -83.28515 -57.97676 923 ant  4.5 2.60 0.9945 0 0;
     10-MPS-006-COU Be-10 quartz 3.065e+06 4.168e+04 07KNSTD;
     10-MPS-006-COU He-3 quartz 7.187e+06 4.083e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 5.462e+06 4.312e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 6.843e+06 5.013e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 6.701e+06 5.219e+05 CRONUS-P 5.190e+09;
     10-MPS-006-COU He-3 quartz 6.250e+06 2.900e+05 CRONUS-P 4.800e+09;
     10-MPS-006-COU Ne-21 quartz 2.546e+07 2.846e+06 CRONUS-A 3.330e+08;
     10-MPS-006-COU Ne-21 quartz 2.543e+07 2.802e+06 CRONUS-A 3.330e+08;
     10-MPS-006-COU C-14 quartz 5.165e+05 1.198e+04;
     10-MPS-006-COU C-14 quartz 3.239e+05 4.550e+03;
     10-MPS-006-COU C-14 quartz 2.624e+05 3.509e+03;
     10-MPS-006-COU C-14 quartz 2.364e+05 3.957e+03;
     '

which you can then use to calculate exposure ages, or whatever.

Likewise, a webpage that contains Cl-36 data will have the text input data as a separate formatted block with tags that look like:

<!-- begin Cl36 --><pre> ... </pre><!-- end Cl36 -->

which you can extract from the HTML string similarly.