--- title: 'flows' author: "Timothée Giraud, Laurent Beauguitte, Marianne Guérois" date: '`r Sys.Date()`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{flows} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5, fig.height = 6 ) ``` # Introduction The **`flows`** package contains functions that select flows, provide statistics on selections and propose map and graph visualisations. The first part of the vignette reminds several methods of flow selection, the second part presents the main functions of the package and the last one proposes an example of analysis based on commuters data in the French Grand Est region. # Analysis of geographic flows: issues and methods In the field of spatial analysis, working on flows implies to focus on the relationships between places rather than on their characteristics. Analysis and flow representation often assume a selection to ease the interpretation. One of the first method developed was the so-called dominant flows (or nodal regions) proposed by Nystuen and Dacey in 1961[^1]. Working on telephone flows between cities in the Seattle area, they sought to highlight hierarchy between locations. According to this method, a place _i_ is dominated by a place _j_ if two conditions are met: 1. the most important flow from _i_ is emitted towards _j_; 2. the sum of the flows received by _j_ is greater than the sum of the flows received by _i_. This method creates what is called in graph theory a tree (acyclic graph) or a forest (a set of unconnected trees) with three types of nodes: dominant, dominated and intermediate. If the method creates a clear functional hierarchy, its major drawback is to undervalue flows intensities. Various methods have subsequently been proposed to better reflect this intensity, one of the most frequently used being the so-called major flows: it selects only the most important flows, absolute or relative, either locally or globally. Analysing commuters data between cities, one may choose to select: * all flows greater than 100; * the 50 first flows (global criterion); * the 10 first flows emitted by each city (local criterion). These criteria can also be expressed in relative form: * flows that represent more than 10% of the active population of each city (local criterion); * flows that take into account 80% of all commuters (global criterion). These methods often highlight hierarchies between places but the loss of information created by the selection is rarely questioned. Hence, it seems useful to propose statistical indicators to assess loss of information and characteristics of the selected flows. # The `flows` package A typical data workflow may be: 1. data preparation 2. flow selection 3. statistics and graphics on the selection 4. graph or map (for dominant flows) ## Data Preparation Flow data can be found in wide (matrix) or long format (_i - j - fij_, i.e. origin - destination - flow intensity). As all **flows** function take flow data in wide format, the `prepare_mat()` function transforms a link list into a square matrix. `prepare_mat()` has four arguments: a data.frame to transform (`mat`), the origin (`i`), the destination (`j`) and the flow intensity (`fij`). ```{r} library(flows) # Import data nav <- read.csv(system.file("csv/nav.csv", package = "flows")) head(nav, 4) # Prepare data mat <- prepare_mat(x = nav, i = "i", j = "j", fij = "fij") mat[1:4, 1:4] ``` ## Flow Selection Four selection methods based on flow origins are accessible through the `select_flows()` function: * `nfirst`: the `k` first flows from all origins; * `xfirst`: all flows greater than a threshold `k`; * `xsumfirst`: as many flows as necessary for each origin so that their sum is at least equal to `k`; * `dominant`: flows that satify a dominance test, this function may be used to select flows obeying the second criterion of Nystuen and Dacey method. Figure 1: The three methods of the `select_flows()` function **Black links are the selected ones.** Methods taking into account the total volume of flows are implemented when using `global = TRUE` parameter. They are identical to the ones described above: selection of the `k` first flows, selection of flows greater than `k` and selection of flows such as the sum is at least equal to `k`. All these functions take as input a square matrix of flows and generate binary matrices of the same size. Selected flows are coded 1, others 0. It is therefore possible to combine criteria of selection through element-wise multiplication of matrices (Figure 2). Figure 2: Flow selection and criteria combination ## Statistics and Graphics The `stat_mat()` function provides various indicators and graphical outputs on a flow matrix to allow statistically sound selections. Measures provided are density (number of present flows divided by the number of possible flows); number, size and composition of connected components; sum, quartiles and average intensity of flows. In addition, four graphics can be plotted: degree distribution curve (by default, outdegree), weighted degree distribution curve, Lorenz curve and boxplot on flow intensities. ```{r, fig.width=5, fig.height=5} # Get statistics about the matrix stat_mat(mat = mat, output = "none", verbose = TRUE) # Plot Lorenz curve only stat_mat(mat = mat, output = "lorenz", verbose = FALSE) ``` ```{r, fig.width=7, fig.height=7} # Graphics only stat_mat(mat = mat, output = "all", verbose = FALSE) # Statistics only mystats <- stat_mat(mat = mat, output = "none", verbose = FALSE) str(mystats) # Sum of flows mystats$sumflows ``` To ease comparisons, the `comp_mat()` function returns a data.frame that provides statistics on differences between two matrices (for example a matrix and selection of this matrix). Visualisation helps analysis, `plot_nodal_flow()` function produces a graph where sizes and colors of vertices depend on their position in the graph (dominant, intermediate or dominated) and links widths depend on flow intensities. The `map_nodal_flows()` function maps the selected flows according to the same principles. Both functions only apply to a dominant flows selection[^2]. # Example: Commuters flows in the French Grand Est As an illustration, we present a brief analysis of commuter flows between urban areas of the Grand Est region in France[^3]. We compare two different thresholds (500 and 1000) on the total volume of flows. ```{r, fig.height = 4, fig.width=4} # Remove the matrix diagonal diag(mat) <- 0 # Selection of flows > 500 mat_sel_1 <- select_flows(mat = mat, method = "xfirst", k = 500, global = TRUE) # Selection of flows > 1000 mat_sel_2 <- select_flows(mat = mat, method = "xfirst", k = 1000, global = TRUE) # Compare initial matrix and selected matrices compare_mat(mat1 = mat, mat2 = mat * mat_sel_1, digits = 0) compare_mat(mat1 = mat, mat2 = mat * mat_sel_2, digits = 0) ``` If we select flows greater than 500 commuters, we loose 96% of all links but only 38% of the volume of flows. With a threshold of 1000 commuters, 98% of links are lost but only 54% of the volume of flows. The following example selects flows that represent at least 20% of the sum of outgoing flows for each urban area. ```{r, fig.height = 6, fig.width=6} # Percentage of each outgoing flows mat_p <- mat / rowSums(mat) * 100 # Select flows that represent at least 20% of the sum of outgoing flows for # each urban area. mat_p_sel <- select_flows(mat = mat_p, method = "xfirst", k = 20) # Compare initial and selected matrices compare_mat(mat1 = mat, mat2 = mat * mat_p_sel, digits = 2) ``` This selection keeps only 8% of all links and 53% of the flows volume. We decide run a dominant flow analysis on this dataset. `nodal_flows()` combines the two criteria in a single function and returns a flow matrix. ```{r} res <- nodal_flows(mat) compare_mat(mat1 = mat, mat2 = res) ``` This analysis keeps 4% of all links and 4% of the flows volume. ```{r, fig.height=5.77, fig.width=6, eval = TRUE} library(sf) library(mapsf) UA <- st_read(system.file("gpkg/GE.gpkg", package = "flows"), layer = "urban_area", quiet = TRUE ) GE <- st_read(system.file("gpkg/GE.gpkg", package = "flows"), layer = "region", quiet = TRUE ) mf_map(GE, col = "#c6deba", border = NA, expandBB = c(0, 0, 0, .25)) out <- map_nodal_flows( mat = mat, x = UA, col_node = c("red", "orange", "yellow"), col_flow = "grey30", leg_pos_node = "topright", leg_pos_flow = "right", leg_flow = "Nb. of commuters", breaks = c(4, 100, 1000, 2500, 8655), lwd = c(1, 4, 8, 16), add = TRUE ) mf_label(out$nodes[out$nodes$w > 6000, ], var = "name", halo = TRUE, overlap = FALSE ) mf_title("Dominant Flows of Commuters") mf_credits("INSEE, 2011") mf_scale() head(out$nodes[order(out$nodes$w, decreasing = TRUE), 2:3, drop = TRUE]) ``` The top of the node hierarchy brings out clearly, in descending order, the domination of Nancy, Strasbourg, Metz and Mulhouse, each attracting more than 10 000 commuters. # Conclusion **`flows`** aims at enabling relevant flows selections, while leaving maximum flexibility to the user. [^1]: J. Nystuen & M. Dacey, 1961, "A Graph Theory Interpretation of Nodal Regions", *Papers and Proceedings of the Regional Science Association*, 7:29-42. [^2]: Viewing options are only dedicated to the nodal regions / dominant flows method since other R packages exist to ensure graph or map representations. [^3]: Data comes from the 2011 French National Census (*Recensement Général de la Population de l'INSEE*). The area includes five administrative regions: Champagne-Ardenne, Lorraine, Alsace, Bourgogne, and Franche-Comté. Cities are urban areas (2010 borders).