From c78468ca1eec1566861144385b862d07f59f8aac Mon Sep 17 00:00:00 2001
From: Anshul Singhvi <anshulsinghvi@gmail.com>
Date: Mon, 23 Sep 2024 18:49:52 -0700
Subject: [PATCH] more wip + started categorical raster

---
 Project.toml                         |   4 +
 chapters/02-attribute-operations.qmd | 185 ++++++++++++++++++++++++++-
 2 files changed, 186 insertions(+), 3 deletions(-)

diff --git a/Project.toml b/Project.toml
index 0820fb5..7af6b0b 100644
--- a/Project.toml
+++ b/Project.toml
@@ -1,7 +1,9 @@
 [deps]
 ArchGDAL = "c9ce4bd3-c3d5-55b8-8973-c0e20141b8c3"
 CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
+CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964"
 DimensionalData = "0703355e-b756-11e9-17c0-8b28908087d0"
 GeoDataFrames = "62cb38b5-d8d2-4862-a48e-6a340996859f"
 GeoFormatTypes = "68eda718-8dee-11e9-39e7-89f7f65f511f"
@@ -14,9 +16,11 @@ LightOSM = "d1922b25-af4e-4ba3-84af-fe9bea896051"
 OSMToolset = "a1c25ae6-0f93-4b3a-bddf-c248cb99b9fa"
 OpenStreetMapX = "86cd37e6-c0ff-550b-95fe-21d72c8d4fc9"
 Proj = "c94c279d-25a6-4763-9509-64d165bea63e"
+Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
 Rasters = "a3a2b9e3-a471-40c9-b274-f788e487c689"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
+TidierData = "fe2206b3-d496-4ee9-a338-6a095c4ece80"
 
 [compat]
 GeometryOps = "0.1.12"
diff --git a/chapters/02-attribute-operations.qmd b/chapters/02-attribute-operations.qmd
index df68889..28722ba 100644
--- a/chapters/02-attribute-operations.qmd
+++ b/chapters/02-attribute-operations.qmd
@@ -135,7 +135,7 @@ Notice that the first column, `:geom`, is composed of `IGeometry{wkbMultiPolygon
 
 We can also get some geospatial information - `GI.geometrycolumns(world)` returns `{julia} GI.geometrycolumns(world)`, and `GI.crs(world)` returns `{julia} GI.crs(world)`.
 
-## Dropping geometries
+### Dropping geometries
 
 We can drop the geometry column by subsetting the `DataFrame`:
 
@@ -150,7 +150,24 @@ Becoming skilled at geographic attribute data manipulation means becoming skille
 
 ### Vector attribute subsetting
 
-There are multiple ways to subset data in Julia.  First, and probably most simply, we can index into the DataFrame object using a few kinds of selectors.  Rows are always selected first, and then columns go in the second position.  We can select the first 5 rows of the `:pop_est` column like so:
+There are multiple ways to subset data in Julia.  
+First, and probably most simply, we can index into the DataFrame object using a few kinds of selectors.  This can select rows and columns.  
+
+Indices placed inside square brackets placed directly after a data frame object name specify the elements to keep.
+
+Rows are always selected first, and then columns go in the second position.  We can select the first 5 rows of the `:pop_est` column like so:
+
+::: {.callout-note collapse="true"}
+
+## Recap: indexing in Julia
+
+Indexing in Julia is 1-based, like R, and unlike Python which is 0-based.
+
+It's performed using the `[inds...]` operator.  The `:` operator is used to select all elements in that dimension, and you can select a range using `start:stop`. 
+You can also pass vectors of indices or boolean values to select specific elements.
+
+In DataFrames.jl, you can construct a view over all rows by using the `!` operator, like `world[!, :pop]` (in place of `world[:, :pop]`).  This syntax is also needed when modify the entire column, or creating a new column.
+:::
 
 ```{julia}
 world[1:5, :pop]
@@ -164,4 +181,166 @@ world[5:end, [:pop, :continent]]
 
 and note that this returns a new DataFrame with only the selected columns.
 
-We can also use the `select` function to subset by some predicate.  Let's select all countries whose populations are greater than 30 million, but less than 1 billion:
+We can also drop all missing values in a column using the `dropmissing` function:
+
+```{julia}
+world_with_pop = dropmissing(world, :pop)
+```
+
+There is also a mutating version of `dropmissing`, called `dropmissing!`, which modifies the input in place.
+
+We can also subset by a boolean vector, computed on some predicate.  Let's select all countries whose populations are greater than 30 million, but less than 1 billion.
+```{julia}
+countries_to_select = 30_000_000 .< world_with_pop.pop .< 1_000_000_000
+```
+
+```{julia}
+world_with_pop[countries_to_select, :]
+```
+
+A more concise way to achieve the same result is `world_with_pop[30_000_000 .< world_with_pop.pop .< 1_000_000_000, :]`.
+
+
+Here's a small exercise: guess the number of rows and columns in the `DataFrame` objects returned by each of the following commands, then check your answer by executing the commands in Julia.
+
+```{julia}
+#| eval: false
+world[1:6, ]    # subset rows by position
+world[:, 1:3]    # subset columns by position
+world[1:6, 1:3] # subset rows and columns by position
+world[:, [:name_long, :pop]] # columns by name
+world[:, [true, true, false, false, false, false, false, true, true, false, false]] # by logical indices
+world[:, 888] # an index representing a non-existent column
+```
+
+
+
+There are ways to achieve this result using all of the DataFrame manipulation packages mentioned above.
+
+
+::: {.panel-tabset}
+
+## DataFrames.jl
+
+DataFrames.jl also defines a `subset` function, which is another way to achieve this result:
+
+```{julia}
+subset(world_with_pop, :pop => x -> !ismissing(x) && 30_000_000 < x < 1_000_000_000)
+```
+
+## DataFramesMeta.jl
+
+DataFramesMeta.jl provides a convenient syntax for subsetting DataFrames using a DSL that closely resembles the tidyverse.
+
+```{julia}
+using DataFramesMeta
+
+@chain world_with_pop begin
+    @subset @byrow (!ismissing(:pop) && 30_000_000 < :pop < 1_000_000_000)
+    select(:name_long, :pop)
+end
+``` 
+
+## TidierData.jl
+
+TidierData.jl provides a convenient syntax for subsetting DataFrames using a DSL that closely resembles the tidyverse.  
+
+```{julia}
+#| eval: false
+using TidierData
+
+@chain world_with_pop begin
+    @subset @byrow (!ismissing(:pop) && 30_000_000 < :pop < 1_000_000_000)
+    select(:name_long, :pop)
+end
+```
+
+## Query.jl
+
+Query.jl provides a convenient syntax for subsetting DataFrames using a DSL that closely resembles the tidyverse.  
+
+```{julia}
+using Query
+
+@from row in world_with_pop |>
+@where !ismissing(row.pop) && 30_000_000 < row.pop < 1_000_000_000 |>
+@select {name_long = row.name_long, pop = row.pop} |>
+DataFrame
+
+```
+
+:::
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+## Manipulating raster objects
+
+In contrast to the vector data model underlying simple features (which represents points, lines and polygons as discrete entities in space), raster data represent continuous surfaces.
+This section shows how raster objects work by creating them *from scratch*, building on Section \@ref(an-introduction-to-terra).
+Because of their unique structure, subsetting and other operations on raster datasets work in a different way, as demonstrated in Section \@ref(raster-subsetting).
+
+
+The following code recreates the raster dataset used in Section \@ref(raster-classes), the result of which is illustrated in Figure \@ref(fig:cont-raster).
+This demonstrates how the `Raster()` constructor works to create an example raster named `elev` (representing elevations).
+
+```{julia}
+vals = reshape(1:36, 6, 6)
+elev = Raster(vals, (X(LinRange(-1.5, 1.5, 6)), Y(LinRange(-1.5, 1.5, 6))))
+```
+
+
+The result is a raster object with 6 rows and 6 columns, and spatial lookup vectors for the dimensions `X` (horizontal) and `Y` (vertical).
+The `vals` argument sets the values that each cell contains: numeric data ranging from 1 to 36 in this case.
+
+
+Raster objects can also contain categorical values, like strings or even values corresponding to categories.
+The following code creates the raster datasets shown in Figure \@ref(fig:cont-raster):
+
+```{julia}
+# First, construct a categorical array
+using CategoricalArrays
+
+grain_order = ["clay", "silt", "sand"]
+grain_char = rand(grain_order, 6, 6)
+grain_fact = CategoricalArray(grain_char, levels = grain_order)
+
+# Then, wrap the categorical array in a Raster object
+grain = Raster(grain_fact, (X(LinRange(-1.5, 1.5, 6)), Y(LinRange(-1.5, 1.5, 6))))
+```
+
+```{julia}
+elev = Raster("raster/elev.tif")
+grain = Raster("raster/grain.tif")
+```
+
+This `CategoricalArray` is stored in two parts: a matrix of integer codes, and a dictionary of levels, that maps the integer codes to the string values.
+We can retrieve and modify the levels of a `CategoricalArray` using the `levels()` function.
+