The flows
package contains functions
that select flows, provide statistics on selections and propose map and
graph visualisations.
The first part of the vignette reminds several methods of flow
selection, the second part presents the main functions of the package
and the last one proposes an example of analysis based on commuters data
in the French Grand Est region.
In the field of spatial analysis, working on flows implies to focus on the relationships between places rather than on their characteristics. Analysis and flow representation often assume a selection to ease the interpretation.
One of the first method developed was the so-called dominant flows (or nodal regions) proposed by Nystuen and Dacey in 19611. Working on telephone flows between cities in the Seattle area, they sought to highlight hierarchy between locations. According to this method, a place i is dominated by a place j if two conditions are met:
This method creates what is called in graph theory a tree (acyclic graph) or a forest (a set of unconnected trees) with three types of nodes: dominant, dominated and intermediate. If the method creates a clear functional hierarchy, its major drawback is to undervalue flows intensities.
Various methods have subsequently been proposed to better reflect this intensity, one of the most frequently used being the so-called major flows: it selects only the most important flows, absolute or relative, either locally or globally. Analysing commuters data between cities, one may choose to select:
These criteria can also be expressed in relative form:
These methods often highlight hierarchies between places but the loss of information created by the selection is rarely questioned. Hence, it seems useful to propose statistical indicators to assess loss of information and characteristics of the selected flows.
flows
packageA typical data workflow may be:
Flow data can be found in wide (matrix) or long format (i - j -
fij, i.e. origin - destination - flow intensity). As all
flows function take flow data in wide format, the
prepare_mat()
function transforms a link list into a square
matrix. prepare_mat()
has four arguments: a data.frame to
transform (mat
), the origin (i
), the
destination (j
) and the flow intensity
(fij
).
library(flows)
# Import data
nav <- read.csv(system.file("csv/nav.csv", package = "flows"))
head(nav, 4)
#> i namei wi j namej wj fij
#> 1 1 Paris 5599722 1 Paris 5599722 1698
#> 2 1 Paris 5599722 48 Troyes 75562 4
#> 3 1 Paris 5599722 129 Sens 24625 287
#> 4 1 Paris 5599722 529 Vouziers 2120 4
# Prepare data
mat <- prepare_mat(x = nav, i = "i", j = "j", fij = "fij")
mat[1:4, 1:4]
#> 1 9 20 24
#> 1 1698 0 0 0
#> 9 0 298895 402 281
#> 20 0 264 154743 3040
#> 24 0 259 4500 129717
Four selection methods based on flow origins are accessible through
the select_flows()
function:
nfirst
: the k
first flows from all
origins;xfirst
: all flows greater than a threshold
k
;xsumfirst
: as many flows as necessary for each origin
so that their sum is at least equal to k
;dominant
: flows that satify a dominance test, this
function may be used to select flows obeying the second criterion of
Nystuen and Dacey method.Figure 1: The three methods of the select_flows()
function Black links
are the selected ones.
Methods taking into account the total volume of flows are implemented
when using global = TRUE
parameter. They are identical to
the ones described above: selection of the k
first flows,
selection of flows greater than k
and selection of flows
such as the sum is at least equal to k
.
All these functions take as input a square matrix of flows and generate binary matrices of the same size. Selected flows are coded 1, others 0. It is therefore possible to combine criteria of selection through element-wise multiplication of matrices (Figure 2).
Figure 2: Flow selection and criteria combination
The stat_mat()
function provides various indicators and
graphical outputs on a flow matrix to allow statistically sound
selections. Measures provided are density (number of present flows
divided by the number of possible flows); number, size and composition
of connected components; sum, quartiles and average intensity of flows.
In addition, four graphics can be plotted: degree distribution curve (by
default, outdegree), weighted degree distribution curve, Lorenz curve
and boxplot on flow intensities.
# Get statistics about the matrix
stat_mat(mat = mat, output = "none", verbose = TRUE)
#> matrix dimension: 159 X 159
#> nb. links: 3350
#> density: 0.1333493
#> nb. of components (weak) 1
#> nb. of components (weak, size > 1) 1
#> sum of flows: 2306577
#> min: 1
#> Q1: 4
#> median: 10
#> Q3: 55
#> max: 298895
#> mean: 688.5304
#> sd: 7765.106
# Plot Lorenz curve only
stat_mat(mat = mat, output = "lorenz", verbose = FALSE)
# Statistics only
mystats <- stat_mat(mat = mat, output = "none", verbose = FALSE)
str(mystats)
#> List of 16
#> $ matdim : int [1:2] 159 159
#> $ nblinks : num 3350
#> $ density : num 0.133
#> $ connectcomp : num 1
#> $ connectcompx: int 1
#> $ sizecomp :'data.frame': 1 obs. of 3 variables:
#> ..$ idcomp : int 1
#> ..$ sizecomp: num 159
#> ..$ wcomp : num 2306577
#> $ compocomp :'data.frame': 159 obs. of 2 variables:
#> ..$ id : chr [1:159] "1" "9" "20" "24" ...
#> ..$ idcomp: num [1:159] 1 1 1 1 1 1 1 1 1 1 ...
#> $ degrees :'data.frame': 159 obs. of 3 variables:
#> ..$ id : chr [1:159] "1" "9" "20" "24" ...
#> ..$ degree : num [1:159] 7 89 78 76 87 61 65 55 44 49 ...
#> ..$ wdegree: num [1:159] 2021 318299 170691 148765 157821 ...
#> $ sumflows : num 2306577
#> $ min : num 1
#> $ Q1 : num 4
#> $ median : num 10
#> $ Q3 : num 55
#> $ max : num 298895
#> $ mean : num 689
#> $ sd : num 7765
# Sum of flows
mystats$sumflows
#> [1] 2306577
To ease comparisons, the comp_mat()
function returns a
data.frame that provides statistics on differences between two matrices
(for example a matrix and selection of this matrix).
Visualisation helps analysis, plot_nodal_flow()
function
produces a graph where sizes and colors of vertices depend on their
position in the graph (dominant, intermediate or dominated) and links
widths depend on flow intensities.
The map_nodal_flows()
function maps the selected flows
according to the same principles.
Both functions only apply to a dominant flows selection2.
As an illustration, we present a brief analysis of commuter flows between urban areas of the Grand Est region in France3.
We compare two different thresholds (500 and 1000) on the total volume of flows.
# Remove the matrix diagonal
diag(mat) <- 0
# Selection of flows > 500
mat_sel_1 <- select_flows(mat = mat, method = "xfirst", k = 500, global = TRUE)
# Selection of flows > 1000
mat_sel_2 <- select_flows(mat = mat, method = "xfirst", k = 1000, global = TRUE)
# Compare initial matrix and selected matrices
compare_mat(mat1 = mat, mat2 = mat * mat_sel_1, digits = 0)
#> mat1 mat2 absdiff reldiff
#> nblinks 3191 137 3054 96
#> sumflows 313292 193203 120089 38
#> connectcompx 1 10 9 NA
#> min 1 502 NA NA
#> Q1 4 584 NA NA
#> median 8 880 NA NA
#> Q3 41 1702 NA NA
#> max 8654 8654 NA NA
#> mean 98 1410 NA NA
#> sd 400 1343 NA NA
compare_mat(mat1 = mat, mat2 = mat * mat_sel_2, digits = 0)
#> mat1 mat2 absdiff reldiff
#> nblinks 3191 62 3129 98
#> sumflows 313292 145368 167924 54
#> connectcompx 1 7 6 NA
#> min 1 1021 NA NA
#> Q1 4 1253 NA NA
#> median 8 1792 NA NA
#> Q3 41 2938 NA NA
#> max 8654 8654 NA NA
#> mean 98 2345 NA NA
#> sd 400 1543 NA NA
If we select flows greater than 500 commuters, we loose 96% of all links but only 38% of the volume of flows. With a threshold of 1000 commuters, 98% of links are lost but only 54% of the volume of flows.
The following example selects flows that represent at least 20% of the sum of outgoing flows for each urban area.
# Percentage of each outgoing flows
mat_p <- mat / rowSums(mat) * 100
# Select flows that represent at least 20% of the sum of outgoing flows for
# each urban area.
mat_p_sel <- select_flows(mat = mat_p, method = "xfirst", k = 20)
# Compare initial and selected matrices
compare_mat(mat1 = mat, mat2 = mat * mat_p_sel, digits = 2)
#> mat1 mat2 absdiff reldiff
#> nblinks 3191.00 240.00 2951 92.48
#> sumflows 313292.00 167088.00 146204 46.67
#> connectcompx 1.00 6.00 5 NA
#> min 1.00 3.00 NA NA
#> Q1 4.00 156.00 NA NA
#> median 8.00 323.00 NA NA
#> Q3 41.00 584.50 NA NA
#> max 8654.00 8654.00 NA NA
#> mean 98.18 696.20 NA NA
#> sd 399.87 1147.75 NA NA
This selection keeps only 8% of all links and 53% of the flows volume.
We decide run a dominant flow analysis on this dataset.
nodal_flows()
combines the two criteria in a single
function and returns a flow matrix.
res <- nodal_flows(mat)
compare_mat(mat1 = mat, mat2 = res)
#> mat1 mat2 absdiff reldiff
#> nblinks 3191 134 3057 96
#> sumflows 313292 100781 212511 68
#> connectcompx 1 25 24 NA
#> min 1 4 NA NA
#> Q1 4 188 NA NA
#> median 8 382 NA NA
#> Q3 41 588 NA NA
#> max 8654 8654 NA NA
#> mean 98 752 NA NA
#> sd 400 1192 NA NA
This analysis keeps 4% of all links and 4% of the flows volume.
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
library(mapsf)
UA <- st_read(system.file("gpkg/GE.gpkg", package = "flows"),
layer = "urban_area", quiet = TRUE
)
GE <- st_read(system.file("gpkg/GE.gpkg", package = "flows"),
layer = "region", quiet = TRUE
)
mf_map(GE, col = "#c6deba", border = NA, expandBB = c(0, 0, 0, .25))
out <- map_nodal_flows(
mat = mat, x = UA,
col_node = c("red", "orange", "yellow"),
col_flow = "grey30",
leg_pos_node = "topright",
leg_pos_flow = "right",
leg_flow = "Nb. of commuters",
breaks = c(4, 100, 1000, 2500, 8655),
lwd = c(1, 4, 8, 16), add = TRUE
)
mf_label(out$nodes[out$nodes$w > 6000, ],
var = "name",
halo = TRUE, overlap = FALSE
)
mf_title("Dominant Flows of Commuters")
mf_credits("INSEE, 2011")
mf_scale()
head(out$nodes[order(out$nodes$w, decreasing = TRUE), 2:3, drop = TRUE])
#> name w
#> 3 Nancy 18119
#> 2 Strasbourg 18057
#> 4 Metz 17927
#> 7 Mulhouse 14577
#> 13 Colmar 9666
#> 15 Belfort 8785
The top of the node hierarchy brings out clearly, in descending order, the domination of Nancy, Strasbourg, Metz and Mulhouse, each attracting more than 10 000 commuters.
flows
aims at enabling relevant flows
selections, while leaving maximum flexibility to the user.
J. Nystuen & M. Dacey, 1961, “A Graph Theory Interpretation of Nodal Regions”, Papers and Proceedings of the Regional Science Association, 7:29-42.↩︎
Viewing options are only dedicated to the nodal regions / dominant flows method since other R packages exist to ensure graph or map representations.↩︎
Data comes from the 2011 French National Census (Recensement Général de la Population de l’INSEE). The area includes five administrative regions: Champagne-Ardenne, Lorraine, Alsace, Bourgogne, and Franche-Comté. Cities are urban areas (2010 borders).↩︎