Articles

Where you do not want to read your work

Where you do not want to read your work

My first motivation in my work is to do good, meaningful science. Good science means appropriate statistics, reproducible workflows, with as open data as possible.

My secondary motivation is to never have my work torn apart on Andrew Gelman’s Blog, TWIV, or DataMethods. Of course discussion of published literature and the limitations are great. It needs to be done, but I want my work to be the best possible version of itself.

Read more…

Writing 2023

Writing in 2023

Happy New Year! I hope to be engaged a little more in blogging this year, especially as Twitter seems to be imploding. Someone once told me it is good luck to do a few of things you want to do on New Year Day in order to set the tone for a sucessful day. This normally means reading, writing, thinking, and spending time with family. So I’ll try to start this process right.

Read more…

Merry Christmas

Merry Christmas

Merry Christmas, readers! For those who celebrate, I hope you have a great day of family and fellowship, if that is your thing. If you don’t, I hope that you, too, have a great day of family and fellowship.

The insane number of COVID-19 cases in China is absolutely staggering and my heart goes out to them. To think of a million cases a day is unfathomable. Further, the risk it poses to the world to let a virus run rampant in a population is frightening.

Regardless, I need to get back into writing here more frequently.

Read more…

Zombies and Viruses

Zombies and Viruses

The other day I finished watching the National Geographic series titled “Hot Zone” based on the book by Richard Preston. The whole story is centered primarily on the Reston Virus, an Ebolavirus from the same genus as the Zaire and Sudan Ebola viruses which have ravaged human. Unlike those other viruses, Reston Virus does not appear to be pathogenic to human, but is very much so to non-human primates. This is despite not being that dissimilar, which is cause for concern.

Read more…

Designing Public Health Surveillance

Designing Public Health Surveillance

@seabbs started a very thought provoking thread on mastodon the other day. It only came to my attention today (because I haven’t yet figured out the whole timeline thing). The entire contents of the post are located here in which he asks what should the role of academics be in designing public health surveillance and outbreak response systems. He asks all the right question along the line of are academics (including PhD students, post-docs, early career researchers, etc) the right people to be designing these systems and/or the tools to respond to them. Are the incentives and expertise aligned (e.g., who funds whom and how are careers made). It is a real problem and an open question for public health and academic on how to organize on this topic. It is further compounded by the fact that there are other partners in this dance. There are normal funding government mechanisms, extra/intra-mural funders, NGOs (e.g., like Wellcome, Gates, GAVI, CEPI, among others) and the brass tacks of promotion, compensation, and ego.

Read more…

Suitability of Engineering to Bayes

Suitability of Engineering to Bayes

I was listening to the learning Bayesian statistics podcast in which the host interview Justin Bois of Cal Tech. When listening to how Justin came into statistics from chemical engineering (a story similar to my own), I was struck with how natural the transition from engineering is to Bayes…. My first serious introduction to applied statistics was through projects designed to identify process variation. The experienced engineers all taught me that variation is natural, something you want to reduce in manufacturing, but never something that could be eliminated. That comfort with not having a fixed number led to higher comfort with probabilistic reasoning.

More to write on this later.

Read more…

Moving to mastodon

Moving to mastodon

So Elon bought Twitter. I have always been a bit ambivalent about using platforms in which I am the product. In some cases the ends justify the means, but someone who outwardly endorses things I am vehemently against is something of which I will not put. I’m going to spend more time blogging, coding, and tooting1 at @medewitt@mastodon.social.

  1. yeah I know that will take some getting used to.

Read more…

Gitsubmodules

GitHub Submodules

Not too much to say on this topic other than I think that is is my new favorite tool.

Modules allow you to better encapsulate code by placing subdirectories in their own repository. These can then be added using the :

git submodule add URL

Then you can be off to the races. This also solves some of the problems with trying to deploy websites at a subdomain using a third party service

Read more…

Editing PHP

Editing PHP

A long time ago now someone I deeply respect told me I ought to learn some PHP. I didn’t. However, this website rendering engine is in pretty simple (but simple doesn’t mean unelegant) PHP code. The way the initial code is written, if a given post has a non-date name (e.g., YYYY-MM-DD-something.md) the current code would strip the file extension but leave the remaining letters. When PHP tries to convert these to a date it fails and returns the begining of PHP time (Jan 1, 1970). I ended up fixing the code for my use by just substring the first ten characters and all is well. However, that’s the point of this post. Once you have fluency in one or more programming language, especially general purpose ones, you can move in different waters. You at least have an idea of what the idiom is and what to look for and often that makes all of the difference.

Read more…

tikz blobs

drawing blobs

This looks really neat! Seems like tikz really is the language of the realm when you need to make these kind of diagrams:

https://tex.stackexchange.com/questions/145269/drawing-blobs

Read more…

Helper functions needed in Julia

Helper functions needed for Julia

There are some really nice packages in R that need Julia ports. I really have been enjoying using Julia but it reminds me of Base R when I first started using it (e.g., pre-tidyverse). The easystats suite looks really nice and ripe for a port:

Read more…

Justfiles are cool

Justfiles are cool

I have to say that justfile might be my new favourite multi-lingual build tool. They allow you to use multiple languages within instructions, variables, are less sensitive to tabs, and have some really nice documentation.

Read more…

Forever Chemicals in the Water

Forever Chemicals

The EPA recently issued updated guidance on acceptable levels of two so-called forever chemicals in the drinking water Perfluorooctanoic acid (PFOA) and perfluorinated alkylated substances (PFAS). These substances are used in non-stick applications and have become pervasive in every day life. Unfortunately, these substances are extraordinarily stable and don’t easy degrade in nature, are not captured by waste-water processing, and have been found for years in human serum. The EPA has been slowly lowering the acceptable levels in drinking water as the science evolves. As is the case with many long-term environment studies it takes a long time to gather observational data (with unknown effect sizes), but unsurprising to anyone they have found that even a little of these chemicals could have negative effects on human health.

Read more…

Moving to quarto

tl;dr; porting to Quarto was not terrible. Also moving from GitHub pages to Netlify at the same time was a bit much.

Read more…

Pfizer BNT162b2 for Under 5s

As the parent to a child under the age of five, I have been anxiously awaiting news regarding the availability of a vaccine against COVID-19 for some time. The news recently was that the FDA was encouraging Pfizer/BioNTech to submit the data for their clinical trials for their vaccine candidate for the 6 month to five year old group. Early reporting was that the vaccine wasn’t meeting it’s endpoints with two doses (which were reduced in concentration from the older groups for fear of vaccine induced myocarditis) and would need to pursue a third dose.

Read more…

Waste Water Monitoring and COVID-19

NCDHHS has been participating in waste-water surveillance for SARS-CoV-2. Unfortunately, there is some delay in the data (posted weekly and what is posted is generally a week old or more).

Read more…

An Engineering Mindset to Public Health

Engineering and Public Health

The Centers for Disease Control and Prevention updated their COVID-19 guidelines for isolation and quarantine^[Remember, isolate if you are positive, quarantine if you are a contact.] on December 27, 2021 and received some strong backlash. The updates were certainly needed as the prior guidelines were last updated in December 2020 or so when a vaccine was not available and there were many additional unknowns. In the meantime the United States is in the midst of a Delta wave of infections and growing presence of the Omicron variant, the latter appearing to be more fit and better at escaping vaccine or infection derived immunity. This means that many people are in quarantine and isolation at all levels which is causing issues to basic services (e.g., hospitals down critical staff, flights canceled due to crews in quarantine/isolation, day care centers, schools, other commerce activities, end even college football bowl games). While generally there is agreement that the guidelines needed to be updated based on availability of vaccines and boosters, there is some disagreement on the language and the lack of a testing component.

Read more…

Can Bayesian Analysis Save Paxlovid?

Paxlovid

Paxlovid is a new antiviral (protease inhibitor) that when paired with ritonavir provides excellent protection against hospitalization for those patients infected with SARS-CoV-2. Pfizer, the manufacturer, had two endpoints. The first endpoint was for “High Risk” patients and a secondary endpoint for “Standard Risk” patients. I thought it was neat that BioNTech/Pfizer used a Bayesian analysis for their BNT162b2 vaccine and was a little surprised when I saw a frequentist p-value for their second endpoint:

Read more…

Neutralizing Antibody Titres

Reading the Pre-Prints

A prominant scientist has been posting his highlights of pre-prints on twitter during the course of the pandemic and occasionally, I come across one that makes me scratch my head. I think that the move to open and rapid science through pre-prints has been good, but not without issues. Review (and open review especially) makes work better and is an important part of science and progress.

Read more…

Thinking About Viral Evolution

I have been re-reading Anderson and May. Each time I come back to the book I find gems that are highly salient to what’s going on the world today. The gems that I read yesterday was about viral evolution and virulence. The question was about pathogens becoming more virulent over time vs less virulent and this role in transmission. They set it out in the following familiar equation for the basic reproduction number:

Read more…

RProfiles

.warning {
    display: inline-flex;
    margin: 30px;
    font-weight: 700;
    font-size: 16px;
    /* font-size: 1rem; */
    line-height: 1.25;
    color: #0b0c0c;
}

.warning:before {
    display: inline-block;
    margin: 10px;
    content: "!";
    color: white;
    background: #d4351c;
    font-weight: 700;
    left: 50px;
    min-width: 35px;
    max-width: 35px;
    max-height: 35px;
    min-height: 35px;
    margin-top: -7px;
    border-radius: 50%;
    font-size: 30px;
    line-height: 35px;
    text-align: center;
}

In the R programming environment, you have access to configuration files, typically your .Rprofile and your .Renviron. RStudio has a more thorough write-up available at this link. On the start of a new R session your Environment file is available through a Sys.getenv call. Rprofiles are basically “sourced” in the active R session on start and everything in the Rprofile is available within the session. Generally speaking, your environment file is for definitions (like a typical dotfile) which variables given values in the form of:

Read more…

How Discerning is the Technical Challenge in GBBO?

My wife and I love to watch the Great British Bake Off on Netflix. The competition is for the most part collegial in general and all around feel good television, especially at night. After watching several seasons of the show, a lingering question came to mind: how good are the judges at estimating talent?

Read more…

Spike

I read Jeremy Farrar’s Spike over the weekend. I don’t have anything insightful to say about it. On this side of the pond, the United Kingdom seemed like it had the best shot of handling this pandemic. The way that scientists and scientific advisors are brought to consult and model in a (from the outside) organized fashion as part of the decision-making process plus the venerable National Public Health Service, they had all the pieces to take informed decisions. In the States, sure we have several regulatory bodies and public health agencies, and a group of rag-tag academic that seem to just publish and post on twitter. Nothing coordinated. Additionally, our health systems are disparate, some state-run, some private, some for profit, others not–so, so I was interested to read an insider’s take into how the UK could fail so miserably.

Read more…

Comparing Mortality Rates is Hard

A Simple Question

Say you get a simple question like, how does the mortality rate from COVID-19 for our county (if you live in the United States and not Louisiana) compare to other counties in the state? Seems like a straightforward question right? Well not really. The first question is rate compared to what?

Read more…

Time to Vaccinate

This post is completely inspired by @mjskay ’s very interesting analysis/ critique of a New York Times linear extrapolation of when the United States would reach the critical proportion for vaccination. The New York Times has made some amazing graphics during the pandemic, but their analyses have been pretty spotty. Between this linear extrapolation of vaccination rates and their SIRs for third and fourth waves, they really need to put some additional thought into the more math heavy type work.

Read more…

Skin in the Game

Nassim Nicholas Taleb is a rather polarizing figure, but one point he makes over and over again is that it is better to trust someone who has “skin in the game.” If you have money on the table or stakes in the outcome, you typically sharpen your pencil and try to maximize your chance of winning (and edge or fold so to speak) when you don’t.

Read more…

Negative Binomial Distribution and Epidemics

This is partially just for me to remember how to parameterise a Negative Binomial distribution. Typically, the Negative Binomial is used for the number of Bernoulli trials before the first success or number of successes is achieved (and thus a discrete distribution). For some probability of success, $p$, given $r$ failures and $k$ successes we can define the probability mass function as:

$$ {k + r - 1 \choose k} * (1-p)r*pk $$

Read more…

Bayesian SIR

<script src="2020-08-28-bayesian-sir_files/libs/htmlwidgets-1.6.2/htmlwidgets.js"></script> <script src="2020-08-28-bayesian-sir_files/libs/viz-1.8.2/viz.js"></script> <script src="2020-08-28-bayesian-sir_files/libs/grViz-binding-1.0.10/grViz.js"></script>
library(cmdstanr)
library(magrittr)
library(ggplot2)
library(data.table)
library(deSolve)

Compartment models are commonly used in epidemiology to model epidemics. Compartmental model are composed of differential equations and captured some “knowns” regarding disease transmission. Because these models seek to simulate/ model the epidemic process directly, they are are somewhat more resistant to some biases (e.g. missing data). Strong-ish assumptions must be made regarding disease transmission and varying level of detail can be included in order to make the models closer to reality.

Read more…

Optimisation with Stan

In this post I just wanted to play around using the ability to optimize functions in Stan. I know that there are other solvers available, but I just wanted to try this one to start (and I can’t find it now, but I think this was discussed in Regression and Other Stories).

Read more…

ggdist and Epidemic Curves

library(ggplot2)
library(dplyr)
library(tidyr)
library(ggdist)
library(odin)
theme_set(theme_minimal())

This little post is based on a recent paper by Juul et al. (2020). For infectious disease modelling we are often running many scenarios through the models. Why? Because there are many parameters to estimate, but very few things that we actually observe (cases by day or incidence) so in some sense our models all suffer from identifiability issues. Additionally, many of the parameters that we do measure have some uncertainty around them. For this reason it is very common to run models with different parameters estimates or distributions in order to understand a range of possible values. A simple way to summarise these models is through point-wise quantiles through time. The genius of Juul’s paper is that this can underestimate many outcomes because it might pick up the beginning of one epidemic curve and miss the peak.

Read more…

julia ABM SIR

In this post I just wanted to play with julia and R together on the topic of agent based models. Agent based models have some particular advantages in that they capture the random effects of individuals in the model (and their shifting states). The downside of agent based models is of course that you have to program all of those transitions in your model. Regardless of that, they do show a much wider range of values and allow for very fined tune interactions and effects that you might miss with a deterministic model. I’m also using this as a way of exploring julia and the agents package which is built for this purpose.

Read more…

Sensitivity and Specificity

One important concept that is discuss often is the role of sensitivity and specificity in testing. Here we are looking at the ability of a given diagnostic test to correctly identify true positives and correctly identify false positives. If you always forget which one is sensitivity and which one is specificity, no worries, I feel like everyone googles it.

Read more…

Flatten the Curve

ggplot2::theme_set(ggplot2::theme_minimal())

Introduction

I just wanted to write up a quick exponential growth model for my county. I have been saddened that the local department of public health has been underestimating the power of exponential growth.

Read more…

Airflow on Windows Linux Subsystem

Why Do This?

Like it or not, there are a lot of Windows shops out there. However, many super useful tools are Linux native and don’t really bridge into the Windows world (for good reason). That being said Windows and Microsoft (I think) are acknowledging this gap and have provided the Windows Linux Subsystem as a bridge between these two dominate platforms. This allows us little people to take advantage of some of the high powered tools available.

Read more…

2020 Plans

I’m not one for New Year’s resolutions, but this post is more about those things that I wish to accomplish in this coming year. Someone once told me that you should do a little of whatever you want to do in the coming year on New Year’s day (e.g. if you want to read more, read a bit; if you want to draw, sketch a little something). I would like to blog more, so this should start the year off right.

Read more…

How About Impeachment?

So I suppose this is a logical follow-up to the previous post. Now, instead of approval, we can look at impeachment.

Read more…

Approval Rating Now?

This one is a quick one and based on some work that I have already done. Given the ongoing controversy about President Trump potentially using publically held funds to strong arm a foreign entity for personal and political gain, I figured it was time to do some state-space modeling on approval polls. This will be a quick one just because I want an answer.

Read more…

Integrating Over Your Loss Function

library(tidyverse)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
theme_set(theme_minimal())

Always Be Integrating Over Your Loss Function

This is a line that I heard and saw from a talk that James Savage at an NYC R Conference in which he discussed the importance of integrating over your loss function. I think that this is an infinitely important topic. Statisticians, myself included, often marvel over the the most parsimonious model with the highest explained variance or best predictive power. Additionally, when it comes to assessment and verifying the impact of an intervention, ensuring that one has accounted for as many of the confounding factors is important for uncovering the LATE or ATE (depending on the context).

Read more…

ShinyProxy Serving Websites

Preview Image from shinyproxy.io

One of the neat things about ShinyProxy is that it allows you to package your Shiny applications into Docker containers and then serve them up via centralised “App Store” like front-end with all the goodies you might want (authentication, logging, etc). You get the bonus of building these containers in a container (so some dependency management) and scaling of containers with whatever backend you need.

Read more…

Remembering Apollo

In Memory of Apollo

With all of the media coverage, I have grown very nostalgic regarding the Apollo space program. It always amazes me to read the stories about the science and engineering that were sparked off by the space race.

Read more…

On the use of command line tools

What always amazes me is that years ago when memory was scarce and computational power was expensive, tools were developed to parse and manipulate data that fit these restrictions. Among these tools you find things like AWK, SED, and the bourne shell. I have begun to appreciate these tools for facilitating data work.

Read more…

Defining a Project Workflow

library(data.tree)
acme <- Node$new(name = "data analysis project")
  data <- acme$AddChild("data")
  data_raw <- acme$AddChild("data-raw")
  libs <- acme$AddChild("libs")
  munge <- acme$AddChild("munge")
  src <- acme$AddChild("src")
  output <- acme$AddChild("outputs")
  reports <- acme$AddChild("reports")
  readme <- acme$AddChild("README.Rmd")
  makefile <- acme$AddChild("makefile")
  rproject <- acme$AddChild("project.Rproj")

A motivating example

A consistent workflow is very powerful for several key reasons:

  • Removes the mental burden of structuring a project
  • Makes sharing easier by establishing a common understanding of what does what and definitions (e.g. certain files and certain directories always do certain things)
  • Makes project hand-offs easier (e.g. no one stays in the same job forever, so it is nice when things can transition between owners easier)

Read more…

Finding the Needle in the Haystack

library(tidyverse)

One of the challenges in any kind of prediction problem is understand the impact of a) not identifying the target and b) the impact of falsely indentifying the target. To put it into context, if you are trying to use an algorithm to indentify those with ebola, whats the risk of missing someone (they could infect others with the disease) vs identifying someone who does not have the disease as having the disease (they get quarantined and have their life disrupted). Which is worse? That isn’t a statistical question, it is a context, and even an ethical question (ok, yes, you could also apply a loss function here and use that to find a global minimum value should it exist, but then again that is a choice).

Read more…

State Space Models for Poll Prediction

Motivating Example

I have always been interested in state space modeling. It is really interesting to see how this modeling strategy works in the realm of opinion polling. Luckily I stumbled across an example that James Savage put together for a workshop series on Econometrics in Stan. Additionally, while I was writing this blog post by happenstance Peter Ellis put out a similar state space Bayesian model for the most recent Australian elections. His forecasts were by far the most accurate out there and predicted the actual results. I wanted to borrow and extend from his work as well.

Collect the Data

This is the original data collection routine from James Savage’s work.

library(tidyverse)
library(rvest)
library(rstan)
library(lubridate)

options(mc.cores = parallel::detectCores())

# The polling data
realclearpolitics_all <- read_html("http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html#polls")
# Scrape the data
polls <- realclearpolitics_all %>%
  html_node(xpath = '//*[@id="polling-data-full"]/table') %>%
  html_table() %>%
  filter(Poll != "RCP Average")

Develop a helper function.

# Function to convert string dates to actual dates
get_first_date <- function(x){
  last_year <- cumsum(x=="12/22 - 12/23")>0
  dates <- str_split(x, " - ")
  dates <- lapply(1:length(dates), function(x) as.Date(paste0(dates[[x]],
                                                              ifelse(last_year[x], "/2015", "/2016")),
                                                       format = "%m/%d/%Y"))
  first_date <- lapply(dates, function(x) x[1]) %>% unlist
  second_date <- lapply(dates, function(x) x[2])%>% unlist
  data_frame(first_date = as.Date(first_date, origin = "1970-01-01"),
             second_date = as.Date(second_date, origin = "1970-01-01"))
}

Continue cleaning.

# Convert dates to dates, impute MoE for missing polls with average of non-missing,
# and convert MoE to standard deviation (assuming MoE is the full 95% one sided interval length??)
polls <- polls %>%
  mutate(start_date = get_first_date(Date)[[1]],
         end_date = get_first_date(Date)[[2]],
         N = as.numeric(gsub("[A-Z]*", "", Sample)),
         MoE = as.numeric(MoE))%>%
  select(end_date, `Clinton (D)`, `Trump (R)`, MoE) %>%
  mutate(MoE = ifelse(is.na(MoE), mean(MoE, na.rm = T), MoE),
         sigma = MoE/2) %>%
  arrange(end_date) %>%
  filter(!is.na(end_date))

# Stretch out to get missing values for days with no polls
polls3 <- left_join(data_frame(end_date = seq(from = min(polls$end_date),
                                              to= as.Date("2016-08-04"),
                                              by = "day")), polls) %>%
  group_by(end_date) %>%
  mutate(N = 1:n()) %>%
  rename(Clinton = `Clinton (D)`,
         Trump = `Trump (R)`)

I wanted to extend the data frame with blank values out until closer to the election. This is that step.

polls4 <- polls3 %>%
  full_join(
    tibble(end_date = seq.Date(min(polls3$end_date),
                               as.Date("2016-11-08"), by = 1)))
# One row for each day, one column for each poll on that day, -9 for missing values
Y_clinton <- polls4 %>% reshape2::dcast(end_date ~ N, value.var = "Clinton") %>%
  dplyr::select(-end_date) %>%
  as.data.frame %>% as.matrix
Y_clinton[is.na(Y_clinton)] <- -9
Y_trump <- polls4 %>% reshape2::dcast(end_date ~ N, value.var = "Trump") %>%
  dplyr::select(-end_date) %>%
  as.data.frame %>% as.matrix
Y_trump[is.na(Y_trump)] <- -9

# Do the same for margin of errors for those polls
sigma <- polls4 %>% reshape2::dcast(end_date ~ N, value.var = "sigma")%>%
  dplyr::select(-end_date)%>%
  as.data.frame %>% as.matrix
sigma[is.na(sigma)] <- -9

Our Model

I have modified the model slightly to add the polling inflator that Peter Ellis uses in order to account for error outside of traditional polling error. There is a great deal of literature about this point in the Total Survey Error framework. Basically adding this inflator allows for additional uncertainty to be put into the model.

writeLines(readLines("model.stan"))
// From James Savage at https://github.com/khakieconomics/stanecon_short_course/blob/80263f84ebe95be3247e591515ea1ead84f26e3f/03-fun_time_series_models.Rmd

//and modification inspired by Peter Ellis at https://github.com/ellisp/ozfedelect/blob/master/model-2pp/model-2pp.R

// saved as models/state_space_polls.stan
data {
  int polls; // number of polls
  int T; // number of days
  matrix[T, polls] Y; // polls
  matrix[T, polls] sigma; // polls standard deviations
  real inflator;         // amount by which to multiply the standard error of polls
  real initial_prior;
  real random_walk_sd;
  real mu_sigma;
}
parameters {
  vector[T] mu; // the mean of the polls
  real<lower = 0> tau; // the standard deviation of the random effects
  matrix[T, polls] shrunken_polls;
}
model {
  // prior on initial difference
  mu[1] ~ normal(initial_prior, mu_sigma);
  tau ~ student_t(4, 0, 5);
  // state model
  for(t in 2:T) {
    mu[t] ~ normal(mu[t-1], random_walk_sd);
  }

  // measurement model
  for(t in 1:T) {
    for(p in 1:polls) {
      if(Y[t, p] != -9) {
        Y[t,p]~ normal(shrunken_polls[t, p], sigma[t,p] * inflator);
        shrunken_polls[t, p] ~ normal(mu[t], tau);
      } else {
        shrunken_polls[t, p] ~ normal(0, 1);
      }
    }
  }
}

Compile The Model

state_space_model <- stan_model("model.stan")

Prep the Data


clinton_data <- list(
  T = nrow(Y_clinton),
  polls = ncol(Y_clinton),
  Y = Y_clinton,
  sigma = sigma,
  initial_prior = 50,
  random_walk_sd = 0.2,
  mu_sigma = 1,
  inflator =sqrt(2)
)

trump_data <- list(
  T = nrow(Y_trump),
  polls = ncol(Y_trump),
  Y = Y_trump,
  sigma = sigma,
  initial_prior = 40,
  random_walk_sd = 0.2,
  mu_sigma = 1,
  inflator =sqrt(2)
)

Run the Model

clinton_model <- sampling(state_space_model,
                      data = clinton_data,
                      iter = 600,
                      refresh = 0, chains = 2,
                      control = list(adapt_delta = .95,
                                     max_treedepth = 15))

trump_model <- sampling(state_space_model,
                    data = trump_data,
                    iter = 600,
                    refresh = 0, chains = 2,
                    control = list(adapt_delta = .95,
                                     max_treedepth = 15))

Inferences

# Pull the state vectors
mu_clinton <- extract(clinton_model, pars = "mu", permuted = T)[[1]] %>%
  as.data.frame
mu_trump <- extract(trump_model, pars = "mu", permuted = T)[[1]] %>%
  as.data.frame
# Rename to get dates
names(mu_clinton) <- unique(paste0(polls4$end_date))
names(mu_trump) <- unique(paste0(polls4$end_date))
# summarise uncertainty for each date
mu_ts_clinton <- mu_clinton %>% reshape2::melt() %>%
  mutate(date = as.Date(variable)) %>%
  group_by(date) %>%
  summarise(median = median(value),
            lower = quantile(value, 0.025),
            upper = quantile(value, 0.975),
            candidate = "Clinton")

mu_ts_trump <- mu_trump %>% reshape2::melt() %>%
  mutate(date = as.Date(variable)) %>%
  group_by(date) %>%
  summarise(median = median(value),
            lower = quantile(value, 0.025),
            upper = quantile(value, 0.975),
            candidate = "Trump")

Which gives us the following:


actual_voteshare <- tibble(date = as.Date("2016-11-08"),
                           value = c(48.2, 46.1),
                           candidate = c("Clinton", "Trump"))

bind_rows(mu_ts_clinton, mu_ts_trump) %>%
  ggplot(aes(x = date)) +
  geom_ribbon(aes(ymin = lower, ymax = upper, fill = candidate),alpha = 0.1) +
  geom_line(aes(y = median, colour = candidate)) +
  ylim(20, 70) +
  scale_colour_manual(values = c("blue", "red"), "Candidate") +
  scale_fill_manual(values = c("blue", "red"), guide = F) +
  geom_point(data = polls4, aes(x = end_date, y = `Clinton`), size = 0.2, colour = "blue") +
  geom_point(data = polls4, aes(x = end_date, y = Trump), size = 0.2, colour = "red") +
  xlab("Date") +
  ylab("Implied vote share") +
  ggtitle("Poll aggregation with state-space smoothing",
          subtitle= paste("Prior of 50% initial for Clinton, 40% for Trump on", min(polls4$end_date)))+
  theme_minimal()+
  geom_point(data = filter(actual_voteshare, candidate == "Clinton"),
             aes(date, value), colour = "blue", size = 2)+
  geom_point(data = filter(actual_voteshare, candidate == "Trump"),
             aes(date, value), colour = "red", size = 2)

So the outcome was not totally out of the bounds of a good predictive model!

Read more…

Re-districting in Winston-Salem

Introduction

There was a recent bill introduced in the North Carolina General Assembly to reorganise the city counsel by two Republican state members of the General Assembly. The status quo is that there are a total of eight wards, each with a seat on the city counsel. The mayor votes only when there is a tie. As things are, the current city counsel is composed of four Black individuals and four white men, with a white man as mayor. The proposal from the General Assembly is to collapse several of the ward seats into five wards and then create three permanent at-large city counsel positions. Additionally, the mayor would have a vote on all matters, not just in ties.

Read more…

Omitted Variable Bias

<script src="2019-04-07-omitted-variable-bias_files/libs/htmlwidgets-1.6.2/htmlwidgets.js"></script> <script src="2019-04-07-omitted-variable-bias_files/libs/viz-1.8.2/viz.js"></script> <script src="2019-04-07-omitted-variable-bias_files/libs/grViz-binding-1.0.10/grViz.js"></script>

One important concept to discuss is that of omited variable bias. This occurs when you have endogenous predictors that you do not adequately control for in your analysis.

Read more…

MRP Redux

Background

I recently got a question about using MRP and I thought it would be worthwhile to share some of the additional explanation of using this approach with a simulated data set. Simulating your data and testing your method is a really good way to understand if your model is sensitive enough to detect differences. This kind of fake data simulation will allow you to see if your model fails, before using it production or in the field where the cost of failure is high.

Read more…

Speeding Things Up with Rcpp

Introduction

I worked on something that started in R and then I wanted to speed it up. MCMC is generally a slow process in base R because it can’t be parallelised easily as each state depends on the previous state. Rcpp is a wonderful avenue to speed things up.

Read more…

Latex in ggplot2

ggplot2 latex trick

This is a useful package to use latex notation in {ggplot2}. I saw this on twitter and wish I had written down the originator for proper citation but I forgot at the time.

Read more…

MRP using brms

Multi-Level Regression with Post-Stratification

This blogpost is a reproduction of this. Multi-level regression with post-stratification (MRP) is one of the more powerful predictive strategies for opinion polling and election forecasting. It combines the power of multi-level modeling which anticipate and predict group and population level effects which have good small N predictive properties with post-stratification which helps to temper the predictions to the demographics or other variables of interest. This method has been put out by Andrew Gelman et al and there is a good deal of literature on the subject.

Read more…

Replicating gsynth

Introduction

Synthethic methods are a method of causal inference that seeks to combine traditional difference-in-difference types studies with time series cross sectional differences with factor analysis for uncontrolled/ unobserved measures. This method has been growing our of work initially from Abadie (Abadie, Diamond, and Hainmueller 2011a) and growing in importance/ research for state level policy analysis. It is interesting from the fact that it combines some elements of factor analysis to develop predictors and regression analysis to try to capture explained and unexplained variance.

Read more…

the power of fake data simulations

This post is basically a self exercise of what Andrew Gelman has already posted here. Fake data simulations are incredible tools to understand your study. It forces you to think about what is the size of the effect you wish to see, what kind of variance is in your model, if you can really detect it, will your design work, and the list goes on. Similar to any practice, when you have to think critically and put things to paper you tend to see the weaknesses of your arguments. It also helos you to anticipate issues. All of these things are priceless.

Read more…

a foray into network analysis

Marriage Networks of Florence

This post takes the example provided in Kosuke Imai’s Quantitative Social Science: An Introduction. It provides some exploration of network analysis through looking at the florentine dataset which counts the marraige relationships between several Florentine families in the 15th century. My historical side knows that our analysis should show that the [Medici Family] should appear at the center of all of the political intrigue. Let’s see if the analysis plays out.

Read more…

models of microeconomics

I just wanted to explore a little more some of the topics covered in the fantastic Applied Econometrics with R. All of these examples come from their text in Chapter 3.

Read more…

Analysis of Short Time Series

I enjoy reading Rob Hyndman’s blog. The other day he did some analysis of a short times series. More about that is available at his blog here. The neat thing that he shows is that you don’t need a tremendous amount of data to decompose seasonality. Using fourier transforms.

Read more…

make your own api

<script src="2018-07-12-make-your-own-api_files/libs/htmlwidgets-1.6.2/htmlwidgets.js"></script> <script src="2018-07-12-make-your-own-api_files/libs/d3-3.3.8/d3.min.js"></script> <script src="2018-07-12-make-your-own-api_files/libs/dagre-0.4.0/dagre-d3.min.js"></script> <script src="2018-07-12-make-your-own-api_files/libs/mermaid-0.3.0/dist/mermaid.slim.min.js"></script> <script src="2018-07-12-make-your-own-api_files/libs/chromatography-0.1/chromatography.js"></script> <script src="2018-07-12-make-your-own-api_files/libs/DiagrammeR-binding-1.0.10/DiagrammeR.js"></script>

Read more…

IRT and the Rasch Model

I am inspired by the blog post completed by the team at stitchfix (see details here) regarding using item response theory for “latent size.” It is a neat approach. I need to say upfront that I am not a trained psychometrician, psychologist, etc. I understand the statistics, but there are nuances of which I know I am ignorant. However, I know that these methods are certainly worthwhile and worth pursuing.

Read more…

Exploring forecast

Time series forecasting is interesting. I want to write more, but not right now. I’m just going to lay out a few functions for explanation later.

Read more…

Speed it up!

This will be a short post tonight. The topic is speeding up existing code. Typically the rule is make it work (e.g. do what you want it to do) before optimizing it. The trick then is that with some experience you can write your code so that you don’t box yourself into slow methods. That I won’t cover in this post.

Read more…

ggrough

<script src="2018-07-05-ggrough-a-good-way-make-graphics-look-user-generated_files/libs/htmlwidgets-1.6.2/htmlwidgets.js"></script> <script src="2018-07-05-ggrough-a-good-way-make-graphics-look-user-generated_files/libs/roughjs-2.0.1/rough.min.js"></script> <script src="2018-07-05-ggrough-a-good-way-make-graphics-look-user-generated_files/libs/ggrough-binding-0.1.0/ggrough.js"></script>

Read more…

Bayesian Time Series Analysis with bsts

Bayesian methods are where we need to go. I have pretty strong opinions on this as Bayes provides a way to attune for our prior understanding of the world, move away from null hypothesis testing and take advantage of prior work. Additionally, hierarchical modeling provides a way to collectively pool strength when you have a smaller number of data points. All in all it is a great way to do analysis.

Read more…

gghighlight for the win

gghighlight is a package that is on cran that allows one to highlight certain features ones finds valuable. Right now I typically do this with some custom color coding, then pass that into the ggplot2 arguments. This might serve as a good way to more easily automate this task. Additionally, this could be super handy during exploratory analysis where this is much more iterative to find patterns.

Read more…

Let’s Try Some Visualisation

So I am going to try to use this blog to capture programming tricks and examples of R coding, data visualisation and general musing that I find interesting. This is mainly for personal use and as a nice memory tool.

Read more…