--- title: "Genetic assignment to Mixed-Stock Analysis with RHISEA" author: - name: Sosthene Akia email: "sosthene.akia@dfo-mpo.ca" affiliation: Fisheries and Oceans Canada - name: Alex Hanke email: "alex.hanke@dfo-mpo.ca" affiliation: Fisheries and Oceans Canada date: "2026-03-16" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comprehensive Guide to the RHISEA Package for Mixed-Stock Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview This tutorial illustrates how to integrate individual-level genetic assignment results into RHISEA to estimate mixed-stock proportions. It uses genotype data in GENEPOP format, processed with adegenet and assignPOP, and then feeds assignment probabilities and confusion matrices to `run_hisea_estimates()` for population-scale mixing estimates. The workflow includes: - Converting GENEPOP data to an adegenet/assignPOP-compatible format - Performing Monte Carlo and K-fold cross-validation for assignment accuracy - Generating individual assignment probabilities and confusion matrices - Passing these outputs to RHISEA for mixed-stock estimation. ## Data preparation and import In this section, genotype data in GENEPOP format are read and filtered, then combined with morphometric data to create an integrated baseline dataset. ``` r # Read baseline GENEPOP data and reduce low-variance loci YourGenepop <- read.Genepop( "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/simGenepop.txt", pop.names = c("pop_A", "pop_B", "pop_C"), haploid = FALSE ) #> #> Converting data format... #> #> Encoding genetic data... #> #> ################ assignPOP v1.3.1 ################ #> #> A GENEPOP format file was successfully imported! #> #> Imported Data Info: 90 obs. by 1000 loci (diploid) #> Number of pop: 3 #> Number of inds (pop_A): 30 #> Number of inds (pop_B): 30 #> Number of inds (pop_C): 30 #> DataMatrix: 90 rows by 2001 columns, with 2000 allele variables #> #> Data output in a list comprising the following three elements: #> YOUR_LIST_NAME$DataMatrix #> YOUR_LIST_NAME$SampleID #> YOUR_LIST_NAME$LocusName YourGenepopRd <- reduce.allele(YourGenepop, p = 0.95) #> #> New data matrix has created! :) #> New DataMatrix size: 90 rows by 1387 columns #> 614 columns (alleles) have been removed #> 693 loci remaining # Integrate genetic and morphometric data YourIntegrateData <- compile.data( YourGenepopRd, "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/morphData.csv" ) #> Import a .CSV file. #> 4 additional variables detected. #> Checking variable data type... #> D1.2(integer) D2.3(integer) D3.4(integer) D1.4(integer) please enter variable names for changing data type (separate names by a whitespace if multiple) #> Error in compile.data(YourGenepopRd, "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/morphData.csv"): Please enter correct feature names. # Import morphometric data morphdf <- read.csv( "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/morphData.csv", header = TRUE ) ``` Population labels are then constructed and appended to the morphometric dataset, illustrating both character and numeric encodings. ``` r # Population labels as characters pop_label_char <- c(rep("pop_A", 30), rep("pop_B", 30), rep("pop_C", 30)) morphdf_pop_char <- cbind(morphdf, pop_label = pop_label_char) # Population labels as numeric factors pop_label_num <- c(rep(1, 30), rep(2, 30), rep(3, 30)) morphdf_pop <- cbind(morphdf, pop_label = pop_label_num) morphdf_pop$pop_label <- as.factor(morphdf_pop$pop_label) ``` ## Assignment cross-validation Monte Carlo and K-fold cross-validation are used to evaluate assignment performance for different training proportions and subsets of loci. ``` r # Monte Carlo cross-validation using SVM assign.MC( YourGenepopRd, train.inds = c(0.5, 0.7, 0.9), train.loci = c(0.1, 0.25, 0.5, 1), loci.sample = "fst", iterations = 30, model = "svm", dir = "Result-folder/" ) #> #> Parallel computing is on. Analyzing data using 11 cores/threads of CPU... #> #> Monte-Carlo cross-validation done!! #> 360 assignment tests completed!! # K-fold cross-validation using LDA assign.kfold( YourGenepopRd, k.fold = c(3, 4, 5), train.loci = c(0.1, 0.25, 0.5, 1), loci.sample = "random", model = "lda", dir = "Result-folder2/" ) #> #> Parallel computing is on. Analyzing data using 11 cores/threads of CPU... #> #> K-fold cross-validation done!! #> 48 assignment tests completed!! ``` Cross-validation outputs are summarized to quantify classification accuracy across scenarios. ``` r accuMC <- accuracy.MC(dir = "Result-folder/") # Monte Carlo accuracy #> #> Correct assignment rates were estimated!! #> A total of 360 assignment tests for 3 pops. #> Results were also saved in a 'Rate_of_360_tests_3_pops.txt' file in the directory. accuKF <- accuracy.kfold(dir = "Result-folder2/") # K-fold accuracy #> #> Correct assignment rates were estimated!! #> A total of 48 assignment tests for 3 pops. #> Results were also saved in a 'Rate_of_48_tests_3_pops.txt' file in the directory. # Quick inspection of accuracy objects str(accuMC) #> 'data.frame': 360 obs. of 7 variables: #> $ train.inds : Factor w/ 3 levels "0.5","0.7","0.9": 1 1 1 1 1 1 1 1 1 1 ... #> $ train.loci : Factor w/ 4 levels "0.1","0.25","0.5",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ iters : Factor w/ 30 levels "1","10","11",..: 1 2 3 4 5 6 7 8 9 10 ... #> $ assign.rate.all : num 0.711 0.667 0.6 0.622 0.6 ... #> $ assign.rate.pop_A: num 0.533 0.267 0.4 0.467 0.467 ... #> $ assign.rate.pop_B: num 0.667 0.8 0.467 0.533 0.333 ... #> $ assign.rate.pop_C: num 0.933 0.933 0.933 0.867 1 ... str(accuKF) #> 'data.frame': 48 obs. of 7 variables: #> $ KF : Factor w/ 3 levels "3","4","5": 1 1 1 2 2 2 2 3 3 3 ... #> $ fold : Factor w/ 5 levels "1","2","3","4",..: 1 2 3 1 2 3 4 1 2 3 ... #> $ train.loci : Factor w/ 4 levels "0.1","0.25","0.5",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ assign.rate.all : num 0.667 0.733 0.333 0.522 0.652 ... #> $ assign.rate.pop_A: num 0.7 0.6 0.4 0.25 0.625 ... #> $ assign.rate.pop_B: num 0.3 0.6 0.2 0.5 0.429 ... #> $ assign.rate.pop_C: num 1 1 0.4 0.857 0.875 ... ``` ## Assignment of unknown individuals Unknown individuals are imported and assigned using both genetic-only and integrated data with different classifiers. ``` r # Import GENEPOP file for unknown individuals YourGenepopUnknown <- read.Genepop( "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/simGenepopX.txt" ) #> #> Converting data format... #> #> Encoding genetic data... #> #> ################ assignPOP v1.3.1 ################ #> #> A GENEPOP format file was successfully imported! #> #> Imported Data Info: 30 obs. by 1000 loci (diploid) #> Number of pop: 1 #> Number of inds (pop.1): 30 #> DataMatrix: 30 rows by 1835 columns, with 1834 allele variables #> #> Data output in a list comprising the following three elements: #> YOUR_LIST_NAME$DataMatrix #> YOUR_LIST_NAME$SampleID #> YOUR_LIST_NAME$LocusName # Integrate genetic and morphometric data for unknowns YourIntegrateUnknown <- compile.data( YourGenepopUnknown, "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/morphDataX.csv" ) #> Import a .CSV file. #> 4 additional variables detected. #> Checking variable data type... #> D1.2(integer) D2.3(integer) D3.4(integer) D1.4(integer) please enter variable names for changing data type (separate names by a whitespace if multiple) #> Error in compile.data(YourGenepopUnknown, "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/morphDataX.csv"): Please enter correct feature names. # Non-genetic data only NonGeneticUnknown <- read.csv( "https://raw.githubusercontent.com/alexkychen/assignPOP/master/inst/extdata/morphDataX.csv", header = TRUE ) ``` Two assignment models are illustrated: naive Bayes using genetic data and a decision tree using integrated data. ``` r # 1. Assignment using genetic data and naive Bayes assign.X( x1 = YourGenepopRd, x2 = YourGenepopUnknown, dir = "Result-folder3/", model = "naiveBayes" ) #> #> Known and unknown datasets have unequal features. #> Automatically identify common features between datasets... #> 1375 features are used for assignment. #> Performing PCA on genetic data for dimensionality reduction... #> Assignment test is done! See results in your designated folder. #> Predicted populations and probabilities are saved in [AssignmentResult.txt] ```
plot of chunk assignment_tests_naives_Bayes
plot of chunk assignment_tests_decision_tree