SQL Queries - suggested structure, cautions, and examples

Find here some suggestions on how to structure your queries, some things to be aware of when making queries, and some examples of useful queries you can directly copy and paste into your SQL program that should work in ICE-D.


suggested query structure from Joe:

In my approach to writing SQL queries, I like to think of the structure of an SQL query in the following steps:

1. SELECT: The actual information from the various fields in ICE-D that you want to get (e.g., sample name, sample latitude/longitude/elevation, site name, sample nuclide concentrations, publication DOI's etc.). Attributes to select should be comma separated.

example:

SELECT base_sample.name, base_sample.lat_DD, base_sample.lon_DD, base_sample.elv_m, base_site.short_name, base_publication.DOI

2. FROM: I typically name the base_sample table as the starting table to select from, unless you aren't specifically looking for sample information (for example you might only be looking for nuclide information from a specific lab to plot - see my plot page for 10Be concentrations from every sample measured at Lamont here).

example:

FROM base_sample

3. JOIN: In the event you want information from multiple tables (e.g., you want information from the base_sample and base_site and base_publication tables), you need to specify how these tables are linked together, in other words which fields in which tables are used to link tables together. In ICE-D, these fields (attributes) are all going to be ID fields. There are other ways to link tables, but brief documentation reading suggests this is the most computationally efficient path. The syntax is: JOIN target table ON target table.identifier = starting table.identifier

example if I want information from example above, I need to connect to the base_site table and the base_publication table. This also means, since we have a two-way relation between samples and publications, there is a third table linking the two called base_samplepublicationsmatch:

JOIN base_site ON base_site.id = base_sample.site_id
JOIN base_samplepublicationsmatch ON base_samplepublicationsmatch.sample_id = base_sample.id
JOIN base_publication ON base_publication.id = base_samplepublicationsmatch.publication_id

4. WHERE: this is the section where you describe your conditionals, in other words the criteria you will use to filter data. Either by geography (Lat-Lon), elevation, by sample and/or site type, etc. The first conditional starts with the WHERE statement and any following conditionals start with the AND statement. You must be explicit each time with the name of the table dot name of the attribute being filtered.

example: I want attribute information from the above examples for samples in Colorado (between 37 and 41 degrees North and 102 and 109 degrees West) all above 3500 meters asl.

WHERE base_sample.lat_DD > 37
AND base_sample.lat_DD < 41
AND base_sample.lon_DD > -109
AND base_sample.lon_DD < -102
AND base_sample.elv_m > 3500

So! If you put this all together, here is the query:

SELECT base_sample.name, base_sample.lat_DD, base_sample.lon_DD, base_sample.elv_m, base_site.short_name, base_publication.DOI
FROM base_sample
JOIN base_site ON base_site.id = base_sample.site_id
JOIN base_samplepublicationsmatch ON base_samplepublicationsmatch.sample_id = base_sample.id
JOIN base_publication ON base_publication.id = base_samplepublicationsmatch.publication_id
WHERE base_sample.lat_DD > 37
AND base_sample.lat_DD < 41
AND base_sample.lon_DD > -109
AND base_sample.lon_DD < -102
AND base_sample.elv_m > 3500



A couple cautions:

1. Beware of replicate samples from queries!
Certain tables are linked to others through a two-way relation, or a 'many-to-many' relation instead of the more typical one-way relation.

For example, samples and publications are linked through a two-way relation because it is possible for a publication to have many samples (it better, right? It would hopefully be tough to publish a paper with only ONE exposure age reported in it), but it is also possible for one sample to be referenced in multiple publications (perhaps a research group ran a sample back in the day, but recently re-ran another aliquot of the same exact sample and discussed that sample in a new publication).

Thus, IF YOU DO NOT WANT TO SELECT INFORMATION FROM A SAMPLE MORE THAN ONCE and it shows up in multiple publications, I think it can be useful to use the DISTINCT statement after SELECT (see examples of DISTINCT in use below).


2. Beware of omitted samples from queries! If you are querying the database for specific fields in various tables, be aware that it is possible for an entry to exist in one table that is not linked to anything in another table of interest and that entry WILL NOT BE SELECTED if you want column info from a table where that entry has no data.

For example, someone might have data in ICE-D that is yet unpublished but fully measured. The samples can have calculated ages, but of course they are not linked to any publication entry in the publication table. Thus, if you want sample info AND publication info for a bunch of samples in an area of interest, but some of the samples in that area do NOT belong to a publication, those 'unpublished' samples WILL BE OMITTED FROM THE RESULTS.

There is probably some work around to fix this, like selecting those samples even if they aren't linked to a publication and letting the pubication fields be NULL or something, but I haven't yet figured that out. Maybe a left or right join? Not sure. For the time being, just be aware this can happen and you might not be getting the representative selection you are hoping for.





Example Queries

1. Select all the sample attribute information within a specific site.
New feature: select all fields in a table using the “*”

SELECT base_sample.*
FROM base_sample
JOIN base_site ON base_site.id = base_sample.site_id
WHERE base_site.id = 5013


2. Select necessary attribute information formatted in the v3 exposure age calculator input for 10Be samples in a specific site NOT in Antarctica (if in Antarctica, just change the 'std' flag to the 'ant' flag) assuming zero erosion. Sorry for all the qualifiers.
New feature: CONCAT() which means to concatenate items selected without any specified separator (ie without manually putting in the ' ' after each entry, it would smoosh everything together. See note below why I didn't specify a separator).

SELECT CONCAT(
base_sample.name, ' ',
base_sample.lat_DD, ' ',
base_sample.lon_DD, ' ',
base_sample.elv_m, ' ',
'std ',
base_sample.thick_cm, ' ',
base_sample.density, ' ',
base_sample.shielding, ' ',
'0.000 ',
'2010;',
CHAR(13),
base_sample.name, ' ',
'Be-10 quartz ',
_be10_al26_quartz.N10_atoms_g, ' ',
_be10_al26_quartz.delN10_atoms_g, ' ',
_be10_al26_quartz.Be10_std, ';')
FROM base_sample
JOIN base_site ON base_site.id = base_sample.site_id
JOIN _be10_al26_quartz ON _be10_al26_quartz.sample_id = base_sample.id
Where base_site.id = 5013

note: There might be an easier way to concatenate, namely using the CONCAT_WS command instead of CONCAT and specifying “ ” as the separator, but I couldn't figure out how to ignore the separator AFTER returning to a new line for the second part of the entry using the CHAR(13) statement. Oh well, it still works nicely, you just have to manually separate each entry in the list with its own ' '.



3. Select/list each 14C extraction lab currently represented in the database.
New Feature: only select distinct (ie no repeats) items in a list using DISTINCT statement

SELECT DISTINCT _c14_quartz.extraction_lab
FROM _c14_quartz

note: in this case, since I wasn't looking for sample info at all but rather JUST info from the 14C nuclide table, I use the FROM _c14_quartz statement NOT FROM base_sample



4. Select sample/site/etc. info from samples that fall within a specified geographic extent. For example, if I draw a quick box around the Sierra Nevada Range and I want all the samples within that irregular (ie non rectuangular) geographic area, you can use this example.
New Feature: making use of the ST_WITHIN statement and the ST_POLYGONFROMTEXT statement by writing lon-lat pairs of polygon vertices in WKT (well-known text representation format for geographic coordinates). See note below on POLYGON vs MULTIPOLYGON

SELECT base_sample.name, base_sample.lat_DD, base_sample.lon_DD, base_sample.elv_m, base_site.short_name, base_publication.DOI
FROM base_sample
JOIN base_site ON base_site.id = base_sample.site_id
JOIN base_samplepublicationsmatch ON base_samplepublicationsmatch.sample_id = base_sample.id
JOIN base_publication ON base_publication.id = base_samplepublicationsmatch.publication_id
WHERE ST_WITHIN( 
	POINT(base_sample.lon_DD, base_sample.lat_DD), 
	ST_POLYGONFROMTEXT('POLYGON ((-121.6177832712912 40.65734067083913, -121.9635604215141 40.34959289869229, -121.9326194704155 39.51020821586348, -121.1700934647455 38.40736429119892, -120.6645068022505 37.85398565972527, -119.2607376433683 36.44037323568318, -118.8225805167507 35.71744451090707, -118.8217127020246 35.02779108541822, -118.3855233698941 34.84696655421908, -117.6907312869657 35.39202254584936, -117.6458438423149 36.50688611452704, -118.2344226431773 37.44830616792614, -119.6062811908193 38.92664461511623, -120.4904362798662 39.83503929284614, -120.7698247281214 40.15764916245983, -121.107462474342 40.48937918577258, -121.6177832712912 40.65734067083913))') 
)

note: if you have one file/coverage with multiple polygons (let's say you outline the Sierra Nevada Range AND the Colorado Rockies in one layer instead of two separate layers), then you can slightly alter the query above by replacing ST_POLYGONFROMTEXT with ST_MULTIPOLYGONFROMTEXT and 'POLYGON' with 'MULTIPOLGYON'. You will also need to add another layer of parentheses within the coordinate pairs nested parentheses sets to separate your polygons.

WHERE ST_WITHIN( 
	POINT(base_sample.lon_DD, base_sample.lat_DD), 
	ST_POLYGONFROMTEXT('POLYGON ((-121.75158148689775 41.495610104520644, -116.98560917050116 35.88667669814915, -121.26525778114299 36.50268672543851, -121.75158148689775 41.495610104520644))') 
)

versus

WHERE ST_WITHIN( 
	POINT(base_sample.lon_DD, base_sample.lat_DD), 
	ST_MULTIPOLYGONFROMTEXT('MULTIPOLYGON (((-121.75158148689775 41.495610104520644, -116.98560917050116 35.88667669814915, -121.26525778114299 36.50268672543851, -121.75158148689775 41.495610104520644)), ((-106.2216444831293 42.01435539065905, -102.07168219402207 34.33044083973394, -108.84779249420497 36.664794627356756, -106.2216444831293 42.01435539065905)))') 
)


5. Select samples only within a specific application.
New Feature: not really a new concept, but just demonstrating how sites are linked to applications and how you can isolate sites/samples from specific applications. In the example below, I only want samples from the Greenland application (ie application 3).

SELECT base_sample.name, base_sample.lat_DD, base_sample.lon_DD
FROM base_sample
JOIN base_site ON base_site.id = base_sample.site_id
JOIN base_application_sites ON base_application_sites.site_id = base_site.id
JOIN base_application ON base_application.id = base_application_sites.application_id
WHERE base_application.id = 3

note: notice a potential short cut? We are sorting by application ID, which is also specified in the base_application_sites table so you could just omit the base_application JOIN and make your WHERE statement say: WHERE base_application_sites.application_id = 3 and get the same result. Cool!



6. Selecting samples where the site type (ie base_site.what) is moraine where the short name of the site is known. This would also work for or any other attribute column that is a 'string'.
New Feature: using the LIKE statement to find entries that match the specified string of characters.

SELECT base_sample.name, base_sample.lat_DD, base_sample.lon_DD
FROM base_sample
JOIN base_site ON base_site.id = base_sample.site_id
WHERE base_site.what LIKE "%moraine%"

note: as far as I can tell, the syntax “%your string here%” is NOT caps sensitive so the following query would return samples that belong to sites that are Moraines and moraines. It doesn't have to be an exact match either, so sites that are “Moraine” or “moraine” or “blue ice moraine” or “Lateral Moraine” will all be selected.

If you want an EXACT match, omit the %'s inside the parentheses. Although this statement is still not case sensitive, just FYI.



7. Filtering selection by choosing samples from sites that have AT LEAST 3 samples per site. For example, let's say I only trust moraine ages in the database that are based on at least 3 boulder ages on the moraine, any less and the age might be uncertain.
New Feature: lots of new features in this one, but mainly the GROUP_BY and HAVING COUNT (*) built-in functions.

SELECT base_sample.name, base_sample.lat_DD, base_sample.lon_DD
FROM base_sample
JOIN base_site ON base_site.id = base_sample.site_id
WHERE base_site.what LIKE "%moraine%"
AND base_sample.site_id IN
(SELECT base_sample.site_id FROM base_sample GROUP BY base_sample.site_id HAVING COUNT(*) > 2)

note: the statement before and within the parentheses (I think) works more or less like this:
1. “AND base_sample.site_id IN” = give me all samples where their site_id field meets the following criteria.
2.“SELECT base_sample.site_id FROM base_sample GROUP BY base_sample.site_id” = within the parentheses argument, select the site_id field from the samples table and group all entries in the samples table by their site_id.
3. “HAVING COUNT(*) > 2” = once samples are grouped by their site_id, find and select only samples where the number of samples per respective site is greater than 2 (ie at least 3).