---
title: "Introduction of itsdm with a virtual species"
author: "Lei Song"
date: "2022-11-16"
output:
  rmarkdown::html_document:
    theme: readable
vignette: >
  %\VignetteIndexEntry{Introduction of itsdm with a virtual species}
  %\VignetteEngine{knitr::rmarkdown_notangle}
  %\VignetteEncoding{UTF-8}
---


## Set up

Install your missing packages


```r
install.packages('rnaturalearth')
install.packages('here')
install.packages('virtualspecies')
```


```r
library(sf, quietly = T)
library(itsdm, quietly = T)
library(ggplot2, quietly = T)
library(dplyr, quietly = T)
select <- dplyr::select
```

## Prepare environmental variables

We could use packages like `rnaturalearth` to quickly get the boundary of most countries
and regions. You can also read your study area boundary for sure. Providing
your boundary to function `worldclim2` would allow you to download files from worldclim
version 2 clipping to your area.


```r
library(stars, quietly = T)
library(rnaturalearth, quietly = T)

# Get Africa continent
af <- ne_countries(
  continent = 'africa', returnclass = 'sf') %>%
  filter(admin != 'Madagascar') # remove Madagascar

# Union countries to continent
af_bry <- st_buffer(af, 0.1) %>%
  st_union() %>%
  st_as_sf() %>%
  rename(geometry = x) %>%
  st_make_valid()

bios <- worldclim2(var = 'bio', bry = af_bry,
                   path = tempdir(),
                   nm_mark = 'africa')

# Plot BIO1 to check the variables
# plot(bios %>% slice('band', 1),
#      main = st_get_dimension_values(bios, 'band')[1])
```

In species modeling, people usually want to remove the strong correlations between
environmental variables. `dim_reduce` is such a function you need. The function
could either reduce the dimension of your environmental variable stack itself or according
to a bunch of observations. It also allows you to set a desirable threshold. Note that it only
works on numeric variables. Because categorical variables have less risk of having a high correlation with others, we usually prefer to keep categorical variables.


```r
library(stars, quietly = T)
# An example of reducing dimensions
## Here we didn't set samples, so use whole image
bios_reduce <- dim_reduce(
  bios, threshold = 0.6,
  preferred_vars = c('bio1', 'bio12', 'bio5'))

# Returned ReducedImageStack object
bios_reduce
#> Dimension reduction
#> Correlation threshold: 0.6
#> Original variables: bio1, bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9,
#> bio10, bio11, bio12, bio13, bio14, bio15, bio16, bio17, bio18, bio19
#> Variables after dimension reduction: bio1, bio12, bio9, bio14, bio15
#> ================================================================================
#> Reduced correlations:
#>        bio1 bio12  bio9 bio14 bio15
#> bio1   1.00 -0.04  0.50 -0.07  0.44
#> bio12 -0.04  1.00 -0.03  0.56 -0.16
#> bio9   0.50 -0.03  1.00  0.01 -0.06
#> bio14 -0.07  0.56  0.01  1.00 -0.40
#> bio15  0.44 -0.16 -0.06 -0.40  1.00

# img_reduced of ReducedImageStack is the raster stack
bios_reduce$img_reduced
#> stars object with 3 dimensions and 1 attribute
#> attribute(s):
#>                Min.  1st Qu.   Median     Mean  3rd Qu. Max.   NA's
#> reduced_image     0 17.75392 25.91388 154.5419 83.09974 4347 447385
#> dimension(s):
#>      from   to offset     delta refsys point         values x/y
#> x     975 1388   -180  0.166667 WGS 84 FALSE           NULL [x]
#> y     316  750     90 -0.166667 WGS 84 FALSE           NULL [y]
#> band    1    5     NA        NA     NA    NA bio1,...,bio15
```

## Creating the virtual species

Using virtual species is a crucial method in ecological studies. First, let's create a virtual species using the package `virtualspecies` to know exactly what is happening.


```r
library(here, quietly = T)
library(virtualspecies, quietly = T)

# Subset environmental variables to use
bios_sub <- bios %>% slice('band', c(1, 5, 12, 15))
bios_sub <- stack(as(split(bios_sub), 'Spatial'))

# Formatting of the response functions
set.seed(10)
my.parameters <- formatFunctions(
  bio1 = c(fun = 'dnorm', mean = 25, sd = 5),
  bio5 = c(fun = 'dnorm', mean = 35, sd = 5),
  bio12 = c(fun = 'dnorm', mean = 1000, sd = 500),
  bio15 = c(fun = 'dnorm', mean = 100, sd = 50))

# Generation of the virtual species
set.seed(10)
my.species <- generateSpFromFun(
  raster.stack = bios_sub,
  parameters = my.parameters,
  plot = F)

# Conversion to presence-absence
set.seed(10)
my.species <- convertToPA(
  my.species,
  beta = 0.7,
  plot = F)

# Check maps of this virtual species if you like
# plot(my.species)

# Check response curves
plotResponse(my.species)
```

<img src="intro-virtualspecies-1.png" alt="plot of chunk virtualspecies" style="display: block; margin: auto;" />

## Generate pseudo samples for virtual species


```r
# Sampling
set.seed(10)
po.points <- sampleOccurrences(
  my.species,
  n = 2000,
  type = "presence only",
  plot = FALSE)
po_df <- po.points$sample.points %>%
  select(x, y) %>%
  mutate(id = row_number())
head(po_df)
#>           x          y id
#> 1 -6.083333  11.083333  1
#> 2 -5.750000  10.250000  2
#> 3 38.750000  -6.750000  3
#> 4 39.250000 -10.416667  4
#> 5 26.583333  -8.583333  5
#> 6  0.250000   9.083333  6
```

As we all know, there are commonly sampling bias and observation errors. People use multiple methods to reduce these disturbances in samples. For example, here, we use the function `suspicious_env_outliers` to detect and/or remove possible environmental outliers. This step could be used with other strategies to do sample cleaning.


```r
# Get environmental variable stack
variables <- bios %>% slice('band', c(1, 5,  12, 15))

# Check outliers
occ_outliers <- suspicious_env_outliers(
  po_df,
  variables = variables,
  z_outlier = 5,
  outliers_print = 4,
  visualize = FALSE)
#> Reporting top 4 outliers [out of 6 found]
#> 
#> row [463] - suspicious column: [bio5] - suspicious value: [32.78]
#> 	distribution: 95.714% >= 36.01 - [mean: 37.68] - [sd: 0.66] - [norm. obs: 67]
#> 	given:
#> 		[bio1] > [27.31] (value: 27.39)
#> 		[bio15] <= [102.79] (value: 102.48)
#> 		[bio12] <= [993.00] (value: 845.00)
#> 
#> 
#> row [1346] - suspicious column: [bio5] - suspicious value: [32.96]
#> 	distribution: 95.714% >= 36.01 - [mean: 37.68] - [sd: 0.66] - [norm. obs: 67]
#> 	given:
#> 		[bio1] > [27.31] (value: 27.41)
#> 		[bio15] <= [102.79] (value: 102.34)
#> 		[bio12] <= [993.00] (value: 855.00)
#> 
#> 
#> row [1728] - suspicious column: [bio5] - suspicious value: [33.42]
#> 	distribution: 95.714% >= 36.01 - [mean: 37.68] - [sd: 0.66] - [norm. obs: 67]
#> 	given:
#> 		[bio1] > [27.31] (value: 27.44)
#> 		[bio15] <= [102.79] (value: 66.08)
#> 		[bio12] <= [993.00] (value: 958.00)
#> 
#> 
#> row [821] - suspicious column: [bio5] - suspicious value: [31.90]
#> 	distribution: 98.333% >= 34.51 - [mean: 36.25] - [sd: 0.84] - [norm. obs: 59]
#> 	given:
#> 		[bio1] between (24.12, 25.70] (value: 24.70)
#> 		[bio15] > [113.08] (value: 128.18)
#> 		[bio12] <= [996.00] (value: 658.00)

# Check result
# You could also plot samples overlap with a raster
# plot(occ_outliers,
#      overlay_raster = variables %>% slice('band', 6))
plot(occ_outliers)
```

<img src="intro-remove_outliers-1.png" alt="plot of chunk remove_outliers" style="display: block; margin: auto;" />

```r

# Remove outliers if necessary
occ_outliers <- suspicious_env_outliers(
  po_df, variables = variables,
  rm_outliers = T,
  z_outlier = 5,
  outliers_print = 0L,
  visualize = FALSE)
po_sf <- occ_outliers$pts_occ

# Make occurrences
set.seed(11)
occ_sf <- po_sf %>% sample_frac(0.7)
occ_test_sf <- po_sf %>% filter(! id %in% occ_sf$id)
occ_sf <- occ_sf %>% select(-id) %>% 
  mutate(observation = 1)
occ_test_sf <- occ_test_sf %>% select(-id) %>% 
  mutate(observation = 1)

# Have a look at the samples if you like
# ggplot() +
#   geom_raster(data = as.data.frame(my.species$suitab.raster, xy = T),
#               aes(x, y, fill = layer)) +
#   scale_fill_viridis_c('Suitability', na.value = 'transparent') +
#   geom_sf(data = occ_sf, aes(color = 'Train'), size = 0.8) +
#   geom_sf(data = occ_test_sf, aes(color = 'Test'), size = 0.8) +
#   scale_color_manual('', values = c('Train' = 'red', 'Test' = 'blue')) +
#   theme_classic()

# Recheck the variable correlation
dim_reduce(variables, threshold = 1.0, samples = occ_sf)
#> Dimension reduction
#> Correlation threshold: 1
#> Original variables: bio1, bio5, bio12, bio15
#> Variables after dimension reduction: bio1, bio5, bio12, bio15
#> ================================================================================
#> Reduced correlations:
#>        bio1  bio5 bio12 bio15
#> bio1   1.00  0.69  0.19 -0.22
#> bio5   0.69  1.00 -0.03  0.20
#> bio12  0.19 -0.03  1.00 -0.37
#> bio15 -0.22  0.20 -0.37  1.00
```

Unfortunately, bio1 and bio5 have strong correlation with each other. This might affect the model explanation later.

## Build a simple `isolation_forest` species distribution model

Here we build a SDM using extended isolation forest (with `ndim = 2`) and a sample rate of 0.8.


```r
# Do modeling
it_sdm <- isotree_po(obs = occ_sf,
                     obs_ind_eval = occ_test_sf,
                     variables = variables,
                     sample_size = 0.8,
                     ndim = 2)
```

Let's compare the predicted suitability with real suitability.

<img src="intro-prediction-1.png" alt="plot of chunk prediction" style="display: block; margin: auto;" />

<img src="intro-raw_suit-1.png" alt="plot of chunk raw_suit" style="display: block; margin: auto;" />

Let's do model evaluation using multiple presence-only metrics. In this package, we implement both presence-only and presence-background evaluation metrics. The model calculated evaluation on both training and test datasets. Here we just display evaluation on test dataset. You could check `it_sdm$eval_train` the same way as `it_sdm$eval_test`.


```r
# Metrics based on test dataset
it_sdm$eval_test
#> ===================================
#> Presence-only evaluation:
#> CVI with 0.25 threshold:      0.640
#> CVI with 0.5 threshold:       0.809
#> CVI with 0.75 threshold:      0.704
#> CBI:                          0.986
#> AUC (ratio)                   0.942
#> ===================================
#> Presence-background evaluation:
#> Sensitivity:                  0.972
#> Specificity:                  0.856
#> TSS:                          0.828
#> AUC:                          0.946
#> Similarity indices:
#> Jaccard's similarity index:   0.849
#> Sørensen's similarity index:  0.919
#> Overprediction rate:          0.129
#> Underprediction rate:         0.028
plot(it_sdm$eval_test)
```

<img src="intro-evaluation-1.png" alt="plot of chunk evaluation" style="display: block; margin: auto;" />

The result of `isotree_po` has options to generate response curves and variable analysis together. The response curves include marginal response curves, independent response curves, and Shapley value-based dependence. The variable analysis consists of the Jackknife of Pearson correlation with the result of the full model with all variables and AUC_ratio and variable dependence with SHAP test.


```r
# Plot response curves
## Marginal response curves of bio5 and bio6
plot(it_sdm$marginal_responses, target_var = c('bio1', 'bio5'))
```

<img src="intro-marginal_responses-1.png" alt="plot of chunk marginal_responses" style="display: block; margin: auto;" />


```r
## Independent response curves of variable bio1 and bio12.
plot(it_sdm$independent_responses, target_var = c('bio1', 'bio12'))
```

<img src="intro-independent_responses-1.png" alt="plot of chunk independent_responses" style="display: block; margin: auto;" />


```r
## Variable dependence scatter points with fitted curves made by SHAP test
plot(it_sdm$shap_dependence, smooth_line = FALSE)
```

<img src="intro-variable_dependence-1.png" alt="plot of chunk variable_dependence" style="display: block; margin: auto;" />


```r
# Printing variable analysis could give you enough info of variable importance
it_sdm$variable_analysis
#> Relative variable importance
#> ===================================
#> Methods: Jackknife test and SHAP
#> Numer of variables: 4
#> ===================================
#> Jackknife test
#> Based on Pearson correlation (Max value is 1)
#> [Training dataset]:
#> bio12 With only: //////////////////////////////////////// 0.885
#>       Without  : ////////////////////////////////////////// 0.937
#> bio15 With only: //////////////////////////////// 0.71
#>       Without  : //////////////////////////////////////////// 0.974
#> bio5  With only: /////////////////////////////// 0.692
#>       Without  : //////////////////////////////////////////// 0.983
#> bio1  With only: //////////////////////// 0.543
#>       Without  : //////////////////////////////////////////// 0.983
#> [Test dataset]:
#> bio12 With only: /////////////////////////////////////// 0.878
#>       Without  : ////////////////////////////////////////// 0.934
#> bio5  With only: //////////////////////////////// 0.705
#>       Without  : //////////////////////////////////////////// 0.983
#> bio15 With only: //////////////////////////////// 0.703
#>       Without  : //////////////////////////////////////////// 0.976
#> bio1  With only: //////////////////////// 0.536
#>       Without  : //////////////////////////////////////////// 0.983
#> ======================================================================
#> Jackknife test
#> Based on AUC ratio (Max value of traing and test are 0.946 and 0.942)
#> [Training dataset]:
#> bio12 With only: //////////////////////////////////////// 0.898
#>       Without  : ///////////////////////////////////////// 0.921
#> bio5  With only: ///////////////////////////////////// 0.82
#>       Without  : ////////////////////////////////////////// 0.941
#> bio15 With only: //////////////////////////////////// 0.801
#>       Without  : ////////////////////////////////////////// 0.94
#> bio1  With only: ////////////////////////////////// 0.761
#>       Without  : ////////////////////////////////////////// 0.943
#> [Test dataset]:
#> bio12 With only: //////////////////////////////////////// 0.878
#>       Without  : ///////////////////////////////////////// 0.919
#> bio5  With only: //////////////////////////////////// 0.791
#>       Without  : ////////////////////////////////////////// 0.936
#> bio15 With only: /////////////////////////////////// 0.773
#>       Without  : ////////////////////////////////////////// 0.936
#> bio1  With only: ///////////////////////////////// 0.735
#>       Without  : ////////////////////////////////////////// 0.938
#> ======================================================================
#> SHAP (mean(|Shapley value|))
#> [Training dataset]:
#> bio12 : ############################################ 0.054
#> bio15 : ######################### 0.031
#> bio5  : ######################### 0.03
#> bio1  : ##################### 0.026
#> [Test dataset]:
#> bio12 : ############################################# 0.055
#> bio5  : ########################## 0.032
#> bio15 : ######################### 0.031
#> bio1  : ##################### 0.026

# We also could plot variable importance out
plot(it_sdm$variable_analysis)
```

<img src="intro-variable_analysis-1.png" alt="plot of chunk variable_analysis" style="display: block; margin: auto;" />

According to the analysis, all explanatory variables contribute significantly to the model. This is predictable because the virtual species is made by these four variables. bio12 is the most important variable.

Besides the regular response curves, `itsdm` also makes spatially partial dependence maps. By default in `isotree_po`, Shapley value-based spatial dependence maps are not generated because of the computational efficiency. The user could generate these maps by calling function `spatial_response` later after getting the model done.

Note that a very large raster stack of environmental variables might cause memory failure or super slow computation when calculating Shapely value-based spatially dependence maps. So use it based on your own knowledge of your data. Shapley value-based dependence map will give you a bit more information of the value pushing the prediction higher or lower than average.


```r
# Generate spatially partial dependence maps including Shapley value-based one
## Larger shap_nsim value could make smoother map but takes longer time as a
## trade-off
spatial_responses_all <- spatial_response(
  model = it_sdm$model,
  var_occ = it_sdm$vars_train,
  variables = it_sdm$variables,
  shap_nsim = 20)

# Plot spatial response maps
plot(spatial_responses_all, target_var = c('bio1', 'bio12'))
```

<img src="intro-spatial_response-1.png" alt="plot of chunk spatial_response" style="display: block; margin: auto;" />

Marginal and independent effects only indicate the difference comparing the variable itself. And Shapley value based effect additionally show the relative contribution of one variable comparing to to other variables. For example, SHAP-based effect of bio1 shows that bio1 does not contribute much to the model over some areas even though it is an decisive variable.

The direct result of function `isotree_po` is environmental suitability. We could use function `convert_to_pa` to convert suitability to presence-absence based on different methods: threshold, logistic, and linear conversion, and/or a desirable species prevalence.


```r
# An example of converting to presence-absence map
## Use logistic conversion with alpha = -0.05, beta = 0.5
## and not set species prevalence
pa_map <- convert_to_pa(it_sdm$prediction,
                        method = "logistic",
                        beta = 0.7, # the same with virtual species
                        alpha = -.05,
                        visualize = FALSE)
pa_map; plot(pa_map)
#> Logistic conversion
#> beta = 0.7
#> alpha = -0.05
#> species prevalence = 0.105503625307627
```

<img src="intro-pa-1.png" alt="plot of chunk pa" style="display: block; margin: auto;" />

## Analyze variable dependence

It is always helpful to understand the dependence among variables. The result of function `shap_dependence` or `it_sdm$shap_dependence` can be used to analyze variable dependence with each other.


```r
var_dependence <- shap_dependence(
  it_sdm$model, it_sdm$vars_train,
  variables = it_sdm$variables)

# Multiple ways to plot variable VariableDependence object
## Plot without smooth fit curve
plot(var_dependence,
     target_var = c('bio1', 'bio12'),
     related_var = 'bio5', smooth_line = TRUE)
```

<img src="intro-var_inter_dependence-1.png" alt="plot of chunk var_inter_dependence" style="display: block; margin: auto;" />

Above figure shows bio1 and bio5 have strong correlations.

## Analyze variable contribution

Sometimes, we are interested in some observations, for instance, the outliers. `variable_contrib` is such function that allows you to analyze the contribution of each variable. It relies on Shapley values.


```r
## Analyze variable contribution for interested observations.
## For example, outliers.
var_contrib <- variable_contrib(
  it_sdm$model,
  it_sdm$vars_train,
  it_sdm$vars_test %>% slice(1:6))

# Plot contribution separately for each observation
## By default, it only plot the most 5 important variables for each observation
## You could change `num_features` to show more variables
plot(var_contrib, plot_each_obs = T, num_features = 4)
```

<img src="intro-var_contrib-1.png" alt="plot of chunk var_contrib" style="display: block; margin: auto;" />

For example, No.2 observation is decided largely by bio15 and bio5 negatively. Let's check it with spatial response map together.


```r
ggplot() +
  geom_stars(data = spatial_responses_all$spatial_shap_dependence$bio15) +
  scale_fill_distiller('SHAP-based effect', palette = "RdYlBu",
                       na.value = "transparent") +
  geom_sf(data = occ_test_sf %>% slice(2),
          color = 'blue', pch = 1) +
  theme_linedraw()
```

<img src="intro-var_contrib_plot-1.png" alt="plot of chunk var_contrib_plot" style="display: block; margin: auto;" />