Perform ballot-polling Bayesian audits for ranked voting elections using a Dirichlet-tree prior distribution.

## Installation

#### CRAN

To install the latest minor release of elections.dtree from CRAN:

install.packages("elections.dtree")

#### GitHub

To install the development release of elections.dtree from GitHub:

# install.packages("remotes")
remotes::install_github("fleverest/elections.dtree")

#### Why?

Bayesian audits of elections typically employ a Dirichlet prior, conjugate to a multinomial distribution for the observed ballots. For ranked voting systems, the number of ballot types can grow factorially. In the case of instant-runoff voting (IRV), a popular type of ranked voting system, the number of possible ballot types is n! with n candidates (assuming all candidates are ranked; it is even greater if partially completed ballots are permitted).

The Dirichlet distribution is a simple and effective choice if the number of ballot types is small, but becomes problematic when this number gets large. As n grows, the prior concentration parameters (defined by a0 in our implementation) need to be on the order of 1 / n! to ensure the prior is not overly informative. If n is large enough, this may be smaller than the available precision. Also, the fact that this varies by n is inconvenient. A more practical alternative is given by the Dirichlet-tree distribution, which we implement in this package.

The Dirichlet-tree distribution consists of nested Dirichlet distributions, arranged hierarchically in a tree structure. The structure represents the preference ordering of the candidates. Branches coming out of each node correspond to choices for which candidate to select as the next preferred, and nodes represent a ranking of candidates (a complete ranking for leaf nodes, and an incomplete ranking for internal nodes). We place a Dirichlet distribution at each node, to model the conditional split of preferences at that node. The structure as a whole then defines a Dirichlet-tree distribution. Just like the Dirichlet, it is conjugate to a multinomial distribution. Also, the Dirichlet-tree is a generalisation, including a Dirichlet as a special case.

Using the Dirichlet-tree as a prior distribution allows it to scale efficiently to large n, and does not require setting the concentration parameters (a0) to values that depend on n to perform well. depend on n to perform well.

#### How?

In this package, the tree structure is implemented such that the nodes are only initialised when they appear in the observed ballot data (i.e. when at least one of the ballots includes a preference sequence that is represented by that node). This allows the memory-complexity to be O(n*m), where m is the number of ballots observed in the audit process.

Without such lazy evaluation, the memory-complexity is necessarily O(n!). Hence, our implementation of the Dirichlet distribution (based on a reducible Dirichlet-tree structure) enables a larger set of candidates than would be possible using traditional methods.

To sample unseen ballots without loading all nodes into memory, this repository implements a recursive strategy that generates samples starting from any uninitialized point in the tree. This works well for IRV ballot structures, since the permutation-tree structure is easily navigated given a target ballot.

Currently, only IRV elections have been implemented, but other ranked voting systems could be supported by implementing the corresponding social choice function.

In order to support different tree structures or other elections which deal with high cardinality, a similar recursive strategy for sampling should be developed (and anyone designing a new tree structure should consider this carefully).

## Usage

Let’s start off by downloading some data from PrefLib using prefio. In this example, we select the Wakehurst 2023 dataset containing the ballots cast in Wakehurst for the 2023 NSW Legislative Assembly Election. Other elections for the NSW Legislative Assembly can be found here.

wakehurst <- prefio::read_preflib("nswla/00058-00000273.soi", from_preflib = TRUE)
head(wakehurst)
##                       preferences frequencies
## 1                 [WILLIAMS Toby]       14115
## 2                 [REGAN Michael]        8499
## 3                    [WRIGHT Sue]        3179
## 4                  [HRNJAK Ethan]        1195
## 5 [WILLIAMS Toby > REGAN Michael]         976
## 6 [REGAN Michael > WILLIAMS Toby]         918

We can see that in this election, voters could specify as few as one candidate. Since the underlying structure for prefio::preferences is a ranking matrix, we can easily find the length of the longest ballot as follows:

max(wakehurst$preferences, na.rm = TRUE) ##  6 In this election, the maximum ballot length was not bounded. If we inspect the first-preferences, we can see that while "WILLIAMS Toby" receives the most first-preference ballots, "REGAN Michael" wins the election with the distributed ballots from the IRV process: wakehurst[, 1, by.rank = TRUE] ## preferences frequencies ## 1 [WILLIAMS Toby] 18940 ## 2 [REGAN Michael] 18430 ## 3 [WRIGHT Sue] 7617 ## 4 [HRNJAK Ethan] 4000 ## 5 [SORENSEN Susan] 1220 ## 6 [MAWSON Greg] 1127 social_choice( as.preferences(wakehurst), sc_function = "plurality" ) ## Warning in social_choice(as.preferences(wakehurst), sc_function = "plurality"): ## ballots contains entries with more than one preference. To compute the ## plurality social choice outcome, they will be truncated to their first ## preferences only. ##  "WILLIAMS Toby" social_choice( as.preferences(wakehurst), sc_function = "irv" ) ##$elimination_order
##  "MAWSON Greg"    "SORENSEN Susan" "HRNJAK Ethan"   "WRIGHT Sue"
##  "WILLIAMS Toby"
##
## $winners ##  "REGAN Michael" #### Specifying the prior To work with dirichlet_trees in R, there are two interfaces at your disposal. The first is a standard S3 interface which all R users will be familiar with. The second is an R6 interface which can be useful when you would like to chain commands. Other than the ability to chain commands, either interface is equivalent. Here, we initialise a new Dirichlet-tree for the Wakehurst 2023 election, requiring exactly at least one preference specified per formal ballot and using a prior a0 parameter equal to 1.5. # S3 interface dtree <- dirtree( candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)

# R6 interface (equivalent to S3)
dtree <- dirichlet_tree$new( candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)

dtree
## Dirichlet-tree (a0=1.5, min_depth=1, max_depth=5, vd=FALSE)
## Candidates: WRIGHT Sue SORENSEN Susan REGAN Michael MAWSON Greg WILLIAMS Toby HRNJAK Ethan
## Observations:
##  preferences frequencies
## <0 rows> (or 0-length row.names)

#### Observing data

The data observed during the auditing process should be formatted as a prefio::preferences object. Currently, our wakehurst object has class prefio::aggregated_preferences. We will also shuffle the ballots to imitate a real auditing scenario where we sample the ballots at random. We will only use 1000 ballots for ease of computation. To convert it to the appropriate format and shuffle we can do the following:

ballots <- sample(prefio::as.preferences(wakehurst))[1:1000]

Then to observe our first batch of say 10 ballots, we can update the model to obtain our posterior:

# S3
update(dtree, ballots[1:10])

# R6
dtree$update(ballots[1:10]) # R6 using chained commands dtree <- dirichlet_tree$new(
candidates = names(wakehurst$preferences), min_depth = 1, a0 = 1.5 )$update(ballots[1:10])

dtree
## Dirichlet-tree (a0=1.5, min_depth=1, max_depth=5, vd=FALSE)
## Candidates: WRIGHT Sue SORENSEN Susan REGAN Michael MAWSON Greg WILLIAMS Toby HRNJAK Ethan
## Observations:
##                               preferences frequencies
##                           [REGAN Michael]           3
##  [WILLIAMS Toby > WRIGHT Sue > REGAN ...]           1
##  [MAWSON Greg > WRIGHT Sue > SORENSE ...]           1
##                             [MAWSON Greg]           1
##  [REGAN Michael > WRIGHT Sue > HRNJA ...]           1
##                           [WILLIAMS Toby]           1
##  [WRIGHT Sue > REGAN Michael > HRNJA ...]           1
##                            [HRNJAK Ethan]           1

#### Bayesian inference using the posterior Dirichlet-tree

To conduct Bayesian inference using the posterior we have just obtained, we can employ monte-carlo simulation to simulate the unseen ballots and determine the election outcome many times. This can be used to estimate the probability that any particular candidate goes on to win the election under the posterior distribution.

# S3
ps <- sample_posterior(dtree, n_elections = 1000, n_ballots = length(ballots))

# R6
ps <- dtree$sample_posterior(n_elections = 1000, n_ballots = length(ballots)) # R6 using chained commands ps <- dirichlet_tree$new(
candidates = names(wakehurst$preferences), min_depth = 1, a0 = 1.5 )$update(
ballots[1:10]
)$sample_posterior( n_elections = 100, n_ballots = length(ballots) ) ps ## HRNJAK Ethan MAWSON Greg REGAN Michael SORENSEN Susan WILLIAMS Toby ## 0.09 0.17 0.51 0.00 0.13 ## WRIGHT Sue ## 0.10 We can also take one realisation from the posterior distribution to examine it directly, rather than automatically compute simulated election outcomes and aggregating the results in a single step. To simulate the unobserved N = length(ballots) - 10 ballots from the posterior predictive distribution, we can run the following command: # S3 simulated <- sample_predictive(dtree, n_ballots = length(ballots) - 10) # R6 simulated <- dtree$sample_predictive(n_ballots = length(ballots) - 10)

# R6 using chained commands
simulated <- dirichlet_tree$new( candidates = names(wakehurst$preferences),
min_depth = 1,
a0 = 1.5
)$update( ballots[1:10] )$sample_predictive(n_ballots = length(ballots) - 10)

head(simulated)
##                                preferences frequencies
## 1                          [REGAN Michael]         113
## 2                          [WILLIAMS Toby]          27
## 3                           [HRNJAK Ethan]          26
## 4                            [MAWSON Greg]          24
## 5                         [SORENSEN Susan]          20
## 6 [HRNJAK Ethan > WRIGHT Sue > MAWSON ...]          13

Then we can compute the (IRV) election outcome for this simulation (plus the observed data) explicitly like this:

social_choice(
rbind(
ballots[1:10],
prefio::as.preferences(simulated)
),
sc_function = "irv"
)
## $elimination_order ##  "SORENSEN Susan" "WILLIAMS Toby" "HRNJAK Ethan" "MAWSON Greg" ##  "WRIGHT Sue" ## ##$winners
##  "REGAN Michael"

Finally, we can reset the model to the original prior distribution:

# S3
reset(dtree)

# R6
dtree\$reset()

dtree
## Dirichlet-tree (a0=1.5, min_depth=1, max_depth=5, vd=FALSE)
## Candidates: WRIGHT Sue SORENSEN Susan REGAN Michael MAWSON Greg WILLIAMS Toby HRNJAK Ethan
## Observations:
##  preferences frequencies
## <0 rows> (or 0-length row.names)