meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| pluginto:webpage_scraping [2021/10/23 04:21] – gregbalco | pluginto:webpage_scraping [2022/06/04 03:52] (current) – gregbalco | ||
|---|---|---|---|
| Line 5: | Line 5: | ||
| For example, instead of having your program query the database for raw data and then assemble it into calculator input format, you can often have the web server do it. | For example, instead of having your program query the database for raw data and then assemble it into calculator input format, you can often have the web server do it. | ||
| - | Let's say you want the calculator input for [[http://antarctica.ice-d.org/ | + | Let's say you want the calculator input for [[http://version2.ice-d.org/antarctica/ |
| - | <nowiki> | + | <code> |
| - | <!-- begin v3 -->< | + | <!-- begin v3 -->< |
| - | </nowiki> | + | </code> |
| So your script can just look for those in the HTML and pull out what is between them. In MATLAB, for example, | So your script can just look for those in the HTML and pull out what is between them. In MATLAB, for example, | ||
| - | <nowiki> | + | <code> |
| - | urls = [' | + | |
| - | s = webread(urls); | + | |
| + | % Read a webpage into a string | ||
| + | s = webread(' | ||
| + | |||
| + | % Extract the formatted input data | ||
| l1 = '< | l1 = '< | ||
| l2 = '</ | l2 = '</ | ||
| v3_input_string = s((strfind(s, | v3_input_string = s((strfind(s, | ||
| - | </nowiki> | + | |
| + | </code> | ||
| + | |||
| + | That should produce this result: | ||
| + | |||
| + | < | ||
| + | v3_input_string = | ||
| + | |||
| + | ' | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | ' | ||
| + | </ | ||
| + | |||
| + | which you can then use to calculate exposure ages, or whatever. | ||
| + | |||
| + | Likewise, a webpage that contains Cl-36 data will have the text input data as a separate formatted block with tags that look like: | ||
| + | |||
| + | < | ||
| + | <!-- begin Cl36 -->< | ||
| + | </ | ||
| + | |||
| + | |||
| + | which you can extract from the HTML string similarly. | ||
| + | |||
| + | |||