We access the prediction performance of objective physical activity measures and their ranking relative to other established predictors of 5-year all-cause mortality in the US. In this analysis, we use participants between 50 and 85 years old from the 2003-2004 and 2005-2006 samples of the National Health and Nutritional Examination Survey (NHANES, n=2978, number of deaths in 5 years=297) who wore a hip-worn accelerometer in the free living environment for up to 7 days.

Note: the html file displays only partial R code neccesary to illustrate the functions and arguments used to execute the analysis. Full code is available in the corresponding .Rmd file.

Prerequisites

The following packages will be used in this vignette to provide illustration of NHANES 5-year mortality prediction model. Note that we need to use R’s old default random number generator to reproduce published results.

library("knitr")
library("kableExtra")
library("devtools")
library("ggplot2")
library("gridExtra")
library("corrplot")
library("reshape2")
library("magrittr")
library("plyr")
library("dplyr")
library("survey")
library("mgcv")
library("refund")
library("rnhanesdata")
RNGkind(sample.kind="Rounding")

NHANES 2003-2004 and 2005-2006 5-year mortality model data selection

We start by downloading NHANES 2003-2004 and 2005-2006 cohorts’ information, processing data and combining survey weights for the two cohorts using R package rnhanesdata (Leroux, 2018). Items 1 - 11 illustrate processing steps we performed to obtain the subset of NHANES data that meets this analysis criteria, and derive activity summaries. These steps produce an R object called “data_analysis”, which is used in 5-year mortality prediction model. **For simplicity of presentation, we suppress code printout at each step, which can be changed by setting echo = TRUE in the R chunk option (e.g., {r, echo = TRUE}). Full code is available in the vignette_prediction.Rmd file*

  1. Download and process NHANES lab measurements from the 2 cohorts: systolic blood pressure readings, total cholesterol, and HDL cholesterol. These data are saved in the temporary directory, which is then removed once the covariates information is processed. Blood pressure is recorded up to 4 times here for each participant. The data collection procedure and description of cholesterol and blood pressure variables is available at https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination&CycleBeginYear=2003 for the wave 2003-2004 and at https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination&CycleBeginYear=2005 for the wave 2005-2006.
# Create a (local) temporary directory 
# where lab measurement (cholesterol, blood presure) data will be downloaded from the CDC website 
# and then loaded into R. These files need to be downloaded separately as 
# the raw files associated with these lab measurements are not included in the rnhanesdata package.
dir_tmp = tempfile()

if (!dir.exists(dir_tmp)){
  dir.create(dir_tmp, showWarnings = FALSE)
}
dl_file = function(url) {
  bn = basename(url) 
  destfile = file.path(dir_tmp, bn)
  if (!file.exists(destfile)) {
    out = download.file(url, destfile = destfile, mode="wb")
  }
  stopifnot(file.exists(destfile))
}
## download the lab measurement data for the cohort 2003-2004
# Cholesterol - Total & HDL: LBXTC and LBXHDD
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2003-2004/L13_C.XPT")
# Blood Pressure: BPXSY1 , BPXSY2, BPXSY3 and BPXSY4
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2003-2004/BPX_C.XPT")

## download the lab measurement data for the cohort 2005-2006
# Total Cholesterol: LBXTC
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/TCHOL_D.XPT")
# HDL Cholesterol: LBDHDD
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/HDL_D.XPT")
# Blood Pressure, up to 4 measurements per person: BPXSY1 , BPXSY2, BPXSY3 and BPXSY4
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/BPX_D.XPT")


varnames <- c("LBXTC","LBXHDD","LBDHDD",           ## 1. cholesterol. Note LBXHDD and LBDHDD are the same variable, 
                                                   ##    but different names for 2003-2004 and 2005-2006 cohorts
              "BPXSY1","BPXSY2","BPXSY3", "BPXSY4" ## 2. blood pressure measurements
              )


## load and merge the lab data
lab_data <- process_covar(varnames = varnames,localpath=dir_tmp)

## change column name for cholesterol variable that changed names
colnames(lab_data$Covariate_C)[colnames(lab_data$Covariate_C) == "LBXHDD"] <- "LBDHDD"

## combine waves
CVMarkers <- bind_rows(lab_data$Covariate_C, lab_data$Covariate_D)

rm(list=c("lab_data","dir_tmp","varnames"))
  1. Load minute-level activity data and combine it with lab measurements, survey sampling data, and mortality data which are included in the rnhanesdata package.
## load the data
data("PAXINTEN_C");data("PAXINTEN_D")
data("Flags_C");data("Flags_D")
data("Mortality_2015_C");data("Mortality_2015_D")
data("Covariate_C");data("Covariate_D")

## re-code activity counts which are considered "non-wear" to be 0
## this doesn't impact many data points, most estimated non-wear times correspond to 0 counts
col_vars = paste0("MIN",1:1440)
PAXINTEN_C[, col_vars] <- PAXINTEN_C[, col_vars]*Flags_C[, col_vars]
PAXINTEN_D[, col_vars] <- PAXINTEN_D[, col_vars]*Flags_D[, col_vars]

## Merge covariate, mortality, and accelerometry data
## note that both PAXINTEN_* and Covariate_* have a column
## called "SDDSRVYR" indicating which NHANES wave the data is associated with.
## To avoid duplicating this column in the merged data, we add this variable to the "by"
## argument in left_join()
AllAct_C <- left_join(PAXINTEN_C, Mortality_2015_C, by = "SEQN") %>%
        left_join(Covariate_C, by=c("SEQN", "SDDSRVYR"))
AllAct_D <- left_join(PAXINTEN_D, Mortality_2015_D, by = "SEQN") %>%
        left_join(Covariate_D, by=c("SEQN", "SDDSRVYR"))

AllFlags_C <- left_join(Flags_C, Mortality_2015_C, by = "SEQN") %>%
        left_join(Covariate_C, by=c("SEQN", "SDDSRVYR"))
AllFlags_D <- left_join(Flags_D, Mortality_2015_D, by = "SEQN") %>%
        left_join(Covariate_D, by=c("SEQN", "SDDSRVYR"))

## clean up the workspace for memory purposes
rm(list=c(paste0(c("PAXINTEN_", "Covariate_","Mortality_2015_","Flags_"),rep(LETTERS[3:4],each=4))))
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   2097600  112.1    4267247  227.9         NA   3290721  175.8
## Vcells 155701530 1188.0  400283835 3054.0      16384 342513909 2613.2
## combine data for the two waves
AllAct   <- bind_rows(AllAct_C,AllAct_D)
rm(list = c("AllAct_C","AllAct_D"))
AllFlags <- bind_rows(AllFlags_C,AllFlags_D)
rm(list = c("AllFlags_C","AllFlags_D"))

#merge with cardiovascular markers 
AllAct <- left_join(AllAct, CVMarkers, by = "SEQN")
AllFlags <- left_join(AllFlags, CVMarkers, by = "SEQN")

## clean up the workspace again
rm(list = "CVMarkers");
gc()
##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   2095065  111.9    4267247  227.9         NA   4267247  227.9
## Vcells 156798177 1196.3  400283835 3054.0      16384 400258277 3053.8
  1. Create new factor covariates from NHANES questionnaire, which will be used in the prediction model. In addition, calculate average systolic blood pressure using the 4 measurements per participant.
## Code year 5 mortality, NAs for individuals with follow up less than 5 years and alive
AllAct$yr5_mort <- AllFlags$yr5_mort <- as.integer(
  ifelse(AllAct$permth_exm/12 <= 5 & AllAct$mortstat == 1, 1,
         ifelse(AllAct$permth_exm/12 < 5 & AllAct$mortstat == 0, NA, 0))
                                                    )
## Create Age in years using the age at examination (i.e. when participants wore the device)
AllAct$Age <- AllFlags$Age <- AllAct$RIDAGEEX/12


## Re-level comorbidities to assign refused/don't know as not having the condition
## Note that in practice this does not affect many individuals, but it is an assumption we're making.
levels(AllAct$CHD)    <- levels(AllFlags$CHD)    <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$CHF)    <- levels(AllFlags$CHF)    <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$Stroke) <- levels(AllFlags$Stroke) <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$Cancer) <- levels(AllFlags$Cancer) <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$Diabetes) <- levels(AllFlags$Diabetes) <- list("No" = c("No","Borderline", "Refused","Don't know"), "Yes" = c("Yes"))


## Re-level education to have 3 levels and categorize don't know/refused to be missing
levels(AllAct$EducationAdult) <- levels(AllFlags$EducationAdult) <- list("Less than high school" = c("Less than 9th grade", "9-11th grade"),
                                                                         "High school" = c("High school grad/GED or equivalent"),
                                                                         "More than high school" = c("Some College or AA degree", "College graduate or above"))

## Re-level alcohol consumption to include a level for "missing"
levels(AllAct$DrinkStatus) <- levels(AllFlags$DrinkStatus) <- c(levels(AllAct$DrinkStatus), "Missing alcohol")
AllAct$DrinkStatus[is.na(AllAct$DrinkStatus)] <- AllFlags$DrinkStatus[is.na(AllAct$DrinkStatus)] <- "Missing alcohol"


# systolic blood pressure calculation 
AllAct$SYS <- AllFlags$SYS <-  round( rowMeans(AllAct[,c("BPXSY1","BPXSY2","BPXSY3", "BPXSY4")], na.rm = TRUE))

## Re-order columns so that activity and wear/non-wear flags are the last 1440 columns of our two
## data matrices. This is a personal preference and is absolutely not necessary.
act_cols <- which(colnames(AllAct) %in% paste0("MIN",1:1440))
oth_cols <- which(!colnames(AllAct) %in% paste0("MIN",1:1440))
AllAct   <- AllAct[,c(oth_cols,act_cols)]
AllFlags <- AllFlags[,c(oth_cols,act_cols)]
rm(list=c("act_cols","oth_cols"))
  1. Calculate daily activity summary measures: total activity count (TAC), total log activity count (TLAC), total accelerometer wear time (WT), total minutes of moderate/vigorous physical activity (MVPA), sedentary/sleep/non-wear to active transition probability (SATP\(_{sl/nw}\)), and active to sedentary/sleep/non-wear transition probability (ASTP\(_{sl/nw}\)). In addition, compute total log activity count summary measures (TLAC_1, TLAC_2, …, TLAC_12) in each 2-hr window, i.e. 12AM-2AM, 2AM-4AM, 4AM-6AM, etc. Note, there is one individual with 501 minutes recorded as NA. These missing data occur on the last day they wore the device for the last 501 minutes of the day. We impute these missing data with 0.
## Assign just the activity and wear/non-wear flag data to matrices.
## This makes computing the features faster but is technically required.
act_mat  <- AllAct[,paste0("MIN",1:1440)]
act_mat = as.matrix(act_mat)

## replace NAs with 0s
for (icol in seq(ncol(act_mat))) {
  act_mat[, icol][is.na(act_mat[, icol])]   <- 0
}

AllAct$TAC   <- AllFlags$TAC   <- rowSums(act_mat)
AllAct$TLAC  <- AllFlags$TLAC  <- rowSums(log(1 + act_mat))
AllAct$ST    <- AllFlags$ST    <- rowSums(act_mat < 100)
AllAct$MVPA  <- AllFlags$MVPA  <- rowSums(act_mat >= 2020)

# compute total log activity count in each 2-hr window, 
# 2 hour (120 minutes) binning window 
tlen <- 120
nt   <- floor(1440/tlen)
# create a list of indices for binning into 2-hour windows
inx_col_ls <- split(1:1440, rep(1:nt,each=tlen))
Act_2hr    <- sapply(inx_col_ls, function(x) rowSums(log(1+act_mat[,x,drop=FALSE])))
colnames(Act_2hr) <- paste0("TLAC_",c(1:12))
rm(list=c("tlen","nt","inx_col_ls"))
act_mat = act_mat >= 100
## calculate fragmentation measures
get_sed_act = function(x){
                        mat <- rle(x)
                        rm(x)
                        sed <- mat$lengths[!mat$values]
                        act <- mat$lengths[mat$values]

                        sed <- ifelse(length(sed) == 0, NA, mean(sed))
                        act <- ifelse(length(act) == 0, NA, mean(act))
                        c(sed, act)
}
bout_mat = matrix(nrow = 2, ncol = nrow(act_mat))
for (i in seq(nrow(act_mat))) {
  bout_mat[, i] = get_sed_act(act_mat[i, ])
}
rm(act_mat)
AllAct$SBout <- AllFlags$SBout <- bout_mat[1,]
AllAct$ABout <- AllFlags$ABout <- bout_mat[2,]
AllAct$SATP  <- AllFlags$SATP  <- 1/AllAct$SBout
AllAct$ASTP  <- AllFlags$ASTP  <- 1/AllAct$ABout

rm(bout_mat)
AllAct$WT    <- AllFlags$WT    <- rowSums(AllFlags[,paste0("MIN",1:1440)], na.rm = TRUE)
AllAct   <- cbind(AllAct, Act_2hr)
AllFlags <- cbind(AllFlags, Act_2hr)

rm(list = c("Act_2hr"))
  1. Exclude participants who were: 1) younger than 50 years old, or 85 and older at the time they wore the accelerometer (10, 859 participants); 2) missing BMI or education predictor variables (41 participants); 3) had fewer than 3 days of data with at least 10 hours of estimated wear time (517 participants); 4) missing mortality information (16 participants); 5) alive with follow up less than 5 years (17 participants); 6) missing systolic blood pressure, total cholesterol or HDL cholesterol measurements (293 participants). Table 1 below displays number of participants with missing data for all pairwise combinations of variables with missing data.
##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2100247 112.2    4267247  227.9         NA   4267247  227.9
## Vcells 54494542 415.8  461267778 3519.2      16384 564607656 4307.7
Table 1: Missing data patterns
BMI Education Bad Accel Data Mortality Follow-up Lab
BMI 35 0 8 1 0 7
Education 0 6 2 0 0 1
Bad Accel Data 8 2 517 4 0 57
Mortality 1 0 4 21 0 1
Follow-up 0 0 0 0 0 0
Lab 7 1 57 1 0 293
  1. Create our dataset for analysis, “data_analysis”, with one row per subject containing only those subjects who meet our inclusion criteria.

Thus, after exclusion ctriteria were applied, the data set contained 2978 subjects. Among these participants, 86 had missing alcohol information; for these participants we introduced the category “Missing Alcohol” and retained them in the dataset.

Maximum follow-up time in this cohort in years

## [1] 13.08333
  1. Calculate subject specific averages of the accelerometry features using only the “good” days of data.

  2. Get log(count +1) activity data in a matrix format and conduct functional Principal Component analysis (fPCA) using the fpca.face() function with 50 knots in the refund package in R.

Act <- log(1 + Act_Analysis[,paste0("MIN",1:1440)])
Act[is.na(Act)] <- 0
Act = as.matrix(Act)
fpca_fit <- fpca.face(Act,knots=50)
  1. Retain the first 6 PCs, which explain approximately 57% variability in the daily activity data (the space spanned by the rows of the matrix of non-smoothed log transformed activity counts) and obtain the score for each day on each PC and calculate the mean and standard deviation of these scores for each subject across days.

  1. Identify the PC-specific mean and standard deviation that are significantly associated with 5-year mortality using backward selection logistic model fitted with svyglm() function in R. In this step, the demographic, behavioral and comorbidity variables were included in all models and the selection was conducted on the means and standard deviations of the scores on the 6 PCs (a total of 12 variables) using the compelx survey AIC criteria (Lumley and Scott 2015).

Note, there are 86 participants with missing alcohol information, which was coded as an additional level of the DrinkStatus variable.

Table 2: Final backward selection model estimated coefficients
Estimate Pr(>|t|) OR Estimate OR Lower CL OR Upper CL
Intercept -7.590 0.000 0.001 0.000 0.004
Age 0.080 0.000 1.083 1.059 1.108
Former Smoker 0.487 0.021 1.627 1.073 2.485
Current Smoker 0.941 0.000 2.563 1.797 3.648
Non-Drinker 0.567 0.000 1.763 1.279 2.440
Heavy Drinker 1.053 0.002 2.866 1.423 5.576
Missing Alcohol 0.816 0.065 2.261 0.875 5.197
Underweight 0.597 0.384 1.817 0.400 6.520
Overweight -0.554 0.005 0.575 0.386 0.854
Obese -0.509 0.000 0.601 0.478 0.755
Diabetes: yes 0.286 0.161 1.331 0.879 1.989
CHF: yes 0.833 0.001 2.300 1.356 3.841
CHD: yes 0.113 0.671 1.120 0.649 1.879
Stroke: yes 0.287 0.299 1.332 0.756 2.279
Cancer: yes 0.273 0.100 1.314 0.939 1.823
Mobility Problem 0.620 0.001 1.859 1.270 2.719
Total Cholesterol -0.003 0.243 0.997 0.992 1.002
HDL Cholesterol -0.001 0.840 0.999 0.989 1.008
Systolic Blood Pressure 0.000 0.949 1.000 0.993 1.008
mi1 0.014 0.008 1.014 1.004 1.026
si6 -0.077 0.000 0.926 0.888 0.965
##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2170190 116.0    4267247  227.9         NA   4267247  227.9
## Vcells 98629884 752.5  369014223 2815.4      16384 564607656 4307.7
##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2169664 115.9    4267247  227.9         NA   4267247  227.9
## Vcells 98629139 752.5  295211379 2252.3      16384 564607656 4307.7
  1. Replace the significant PC based measures with corresponding surrogate variables. This step is done to improve the interpretability and comparability (across studies) of results. For those unfamiliar with principal component analysis, reading the section “Intuition behind fPCA” below may be helpful in understanding this step further.

Identifying potential surrogate measures is based on (subjective) interpretations involving the shapes of each principal component. Fundamentally, the idea is to use visual inspection of the principal components to identify the “dominant” features of each component. Then, we return to the original data, and calculate a statistic which we believe captures this dominant feature. For example, looking at the upper-left panel of the plot above in Step 10, we see that days which load negatively on the first the first principal component tend to be extremely active, while those who load positively tend to be very inactive. As a result, one reasonable “guess” at a surrogate measure which is highly associated with average first component is simply the total log transformed activity count (TLAC) for that day. If that is true, we would also expect that the average PC1 score within subjects is highly correlated with their average TLAC across days. In our data, average TLAC and average PC1 score are highly (negatively) correlated (\(\hat{\rho}\) =-0.87), which is expected based on the sign of the first PC.

This procedure would then be repeated for each feature identified as potentially predictive. In our application, we are also interested in the standard deviation of PC6. Looking at the bottom-right panel of the plot above in Step 10, we see that there are 6 periods where the contrast is highest between days with positive and negative loadings (i.e. the difference between the \(+\) and \(-\) curves are largest). One reasonable guess for a statistic which is highly correlated with PC6 score is the difference in average activity during the specific time periods where positive/negative loadings are high/low, respectively. For example, days that load highly on PC6 should, on average, have higher activity during the mid morning (8AM-10AM), late afternoon (3PM-5PM) and late evening (10PM-12AM) and lower activity during the early morning (5AM-7AM), late morning/early afternoon (11AM-1PM), and early evening (6PM-8PM). Since we’re interested in the standard deviation of PC6 score, we calculate the standard deviation of average log-transformed activity counts during these periods as a surrogate measure for \(s_{i6}\). In our analysis, we used all 6 of these time periods and obtained an observed correlation of \(\hat{\rho}\) = 0.87, though multiple choices can be explored to see which statistic has the highest correlation with \(s_{i6}\).

Intuition behind fPCA

A major problem with PC analysis is that it is not always intuitive and requires a degree of familiarity with matrix algebra and complex trajectories (functional data analysis in statistics speak). While some of these problems are unavoidable given the complexity of the data, we will now provide the needed intuition for understanding both the PCs and the implication of our findings on the original data scale (daily minute-level activity profiles). We will start with explaining the first 2PCs, which are shown in the left-top panel of Figure 1.

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2361997 126.2    4267247  227.9         NA   4267247  227.9
## Vcells 99068442 755.9  295211379 2252.3      16384 564607656 4307.7

The first PC (solid blue line) captures 22% of the overall variability in the observed daily profiles of activity. It has a distinct shape, with values starting negative between 12AM and 5AM then becoming strongly positive with a peak around 8AM and slowly decreasing but staying positive until 9:30PM, and then becoming negative after 9:30PM. This is exactly what we expected to see. For each subject, we have 3 - 7 days of valid activity data. Individuals with a positive score on this component (a.k.a., positively loaded on the first PC) on a given day will tend to have less activity during the night hours and more activity during the day hours than the average activity across all subject-days. The biggest difference between such a subject’s day and the average daily activity across all subjects is centered on the morning hours (8AM-9AM). In contrast, the second PC (dashed line) captures 14% of the overall variability in the observed daily activity profiles. It starts positive between 12AM and 2AM, then becomes negative between 2AM and 11AM, with a negative peak at 8AM, increases between 11 AM and 8PM, and decreases while staying positive after 8PM. Participants days with positive scores on this component will be more active in the evening than the average individuals dayly activity.

Now, let us investigate in detail the connection between PCs, scores and individual daily trajectories. The right panel in Figure 1 displays the one day of activity profiles for 3 subjects. Here, we plot the activity data smoothed for each subject and day using thin-plate penalized spline with 30 knots as implemented in the gam() function in the mgcv package in R. The individual days for subjects 21009 (red line) and 21913 (green line) have overall high activity, which is reflected by the high positive loadings on PC 1 (55.89 and 58.75, respectively). In contrast, the individual day’s activity for subject 21039 (blue line) has lower levels, which is reflected by the negative score on the first PC (-21.12). Subjects 21009 (red line) and 21039 (blue line) are mostly active between 7AM and 9PM and their corresponding scores of the second principal component are negative (-30.91 and -36.08, respectively). Subject 21913 (green line) is however, unusually highly active during night hours (8PM - 2AM) with a highly positive score on the second PC (66.05).

The last, but not least important interpretation is of means of scores versus standard deviation of scores. In Figure 1, we have displayed three days, one for each subject. However, each subject has multiple days and each day will get a score and a pattern. For example, on days when subject 21009 (red line) is less active the score on PC1 will be lower, even if the general pattern stays the same. Thus, for every subject and PC we obtain a vector of scores; for 21009 (red line) we obtain (38.39, 55.89, 35.55, 60.61, 40.91, 48.67, -21.25) on the first PC, where we showed only the trajectory corresponding to the 3rd day. What we calculate is the mean of these scores, 36.97, and the standard deviation, 27.27. The mean of the scores is relatively easy to interpret, as it represents whether the average of the 7 days is higher or lower on PC1. The standard deviation captures the day-to-day variability of the individual. In this case the mean and standard deviation for subject 21009 (red line) were 36.97 and 27.27, respectively. In contrast, for subject 21039 (blue line), the mean and standard deviation were -19.16 and 5.02, respectively, much smaller than for subject 21009 (red line). This means that both the overall mean and the daily variability around this larger mean are larger for subject 21009 than for subject 21039 (blue line). This is depicted in Figure 2, where we show smoothed activity data for all days for subjects 21009 (Figure 2a red lines) and 21039 (Figure 2b blue lines). Note that the red lines tend to be higher the blue lines and that the blue lines are less variable around their means. We conclude that, in general, average PC1 scores will tend to distinguish between lower and higher activity individuals whose are, on average, more active over the course of days with available activity data. In contrast, average PC2 scores will distinguish between individuals who, on average, have high activity in the morning and low in the evening/night and individuals who, on average, have lower activity intensity in the morning and higher during the evening/nighttime.

Of course, things are more complicated once we start interpreting every component. Instead, in Table 1 we will provide just the interpretation of those components and summaries that were found to be predictive of the outcome.

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2295656 122.7    4267247  227.9         NA   4267247  227.9
## Vcells 58183398 444.0  236169104 1801.9      16384 564607656 4307.7

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2334742 124.7    4267247  227.9         NA   4267247  227.9
## Vcells 58360178 445.3  188935284 1441.5      16384 564607656 4307.7
Table 3: Interpretation of the results of FPCA
Result Interpretation Surrogates
(-) association mi1 Individuals with higher levels of overall activity during the day, and those who have higher early afternoon activity relative to early AM are associated with later mortality
  1. Average TLAC
(+) association si6 Individuals who are more variable in the start time of their daily activity are associated with earlier mortality.
  1. Standard deviation of ratio of mid-day to morning/afternoon activity 2. Standard deviation of the difference in average activity during peaks/troughs highlighted by PC6
##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2295524 122.6    4267247  227.9         NA   4267247  227.9
## Vcells 58170931 443.9  188935284 1441.5      16384 564607656 4307.7

Mortality prediction model

Once we’ve subset the data to obtain “data_analysis”, we calculate adjusted survey weights. These weights are calculated using the reweight_accel() function (see ?reweight_accel for help) which re-weights observed participants using age, gender, and ethnicity strata (among other things).

Results

Participant characteristics by mortality status are detailed in Table 5. Predictors in Table 5 are ranked according to the AUC criteria in univariate logistic regression models, where each mortality prediction model was fitted with one covariate at a time. For example, TAC (Total activity count) is the most powerful single predictor of the 5-year mortality (AUC = 0.771), the next most predictive variable is Age (AUC = 0.758), and so on. We use cross validated survey weight adjusted AUC (auc_ij_adj column of object auc_mat_1_adj) calculated at the first step of the forward selection process.

Table 3: Demographic and clinical characteristics ranked by their predictive ability in the 5-year mortality model.
Rank Characteristics Total Survivors Decedents AUC
1 TAC 209786.1 (112911) 217926 (111868.8) 136307.9 (94316.3) 0.771
2 Age 65.9 (9.6) 65 (9.3) 73.4 (8.9) 0.758
3 MVPA 13.8 (17.1) 14.7 (17.3) 6.5 (12.1) 0.745
4 ASTP 0.3 (0.09) 0.29 (0.08) 0.37 (0.11) 0.733
5 Sedentary time 1110.6 (108.4) 1102.4 (105.1) 1184 (110.5) 0.728
6 TLAC 2758.5 (727.2) 2811.7 (705.6) 2278.7 (744.9) 0.721
7 TLAC 12PM-2PM 402.5 (117.3) 410.2 (113.9) 333.4 (125.3) 0.697
8 TLAC 4PM-6PM 373.8 (114.7) 381 (112.2) 309.3 (116.4) 0.697
9 TLAC 2PM-4PM 390.6 (119.3) 398.1 (116.7) 323 (121.5) 0.693
10 TLAC 6PM-8PM 313.5 (119.3) 320.5 (118.4) 250.7 (108.7) 0.692
11 TLAC 10AM-12PM 403.4 (129.8) 410.9 (127.4) 335.3 (132.3) 0.682
12 Mobility problem 940 (31.6%) 768 (28.6%) 172 (57.9%) 0.672
13 SD on PC 6 (surrogate) 0.69 (0.27) 0.7 (0.27) 0.57 (0.25) 0.661
14 SATP 0.08 (0.02) 0.08 (0.02) 0.07 (0.02) 0.658
15 TLAC 8AM-10AM 338.5 (154) 344.7 (153.3) 282.4 (149.3) 0.629
16 Education 0.612
Less than high school 945 (31.7%) 822 (30.7%) 123 (41.4%)
High school 739 (24.8%) 659 (24.6%) 80 (26.9%)
More than high school 1294 (43.5%) 1200 (44.8%) 94 (31.6%)
17 TLAC 8PM-10PM 204.4 (121.5) 208.8 (122.6) 165.4 (104.1) 0.603
18 Drinking Status 0.602
Moderate Drinker 1445 (48.5%) 1346 (50.2%) 99 (33.3%)
Non-Drinker 1266 (42.5%) 1106 (41.3%) 160 (53.9%)
Heavy Drinker 181 (6.1%) 153 (5.7%) 28 (9.4%)
Missing alcohol 86 (2.9%) 76 (2.8%) 10 (3.4%)
19 TLAC 6AM-8AM 166.7 (151.2) 171.1 (153.5) 127 (121.9) 0.59
20 Smoking Status 0.586
Never 1326 (44.5%) 1233 (46%) 93 (31.3%)
Former 1147 (38.5%) 1010 (37.7%) 137 (46.1%)
Current 505 (17%) 438 (16.3%) 67 (22.6%)
21 CHF 168 (5.6%) 119 (4.4%) 49 (16.5%) 0.57
22 Gender 0.559
Male 1523 (51.1%) 1331 (49.6%) 192 (64.6%)
Female 1455 (48.9%) 1350 (50.4%) 105 (35.4%)
23 Diabetes 518 (17.4%) 444 (16.6%) 74 (24.9%) 0.558
24 Cancer 455 (15.3%) 382 (14.2%) 73 (24.6%) 0.558
25 BMI 0.553
Normal 760 (25.5%) 663 (24.7%) 97 (32.7%)
Underweight 29 (1%) 22 (0.8%) 7 (2.4%)
Overweight 1150 (38.6%) 1048 (39.1%) 102 (34.3%)
Obese 1039 (34.9%) 948 (35.4%) 91 (30.6%)
26 CHD 244 (8.2%) 196 (7.3%) 48 (16.2%) 0.552
27 Stroke 174 (5.8%) 132 (4.9%) 42 (14.1%) 0.546
28 Race 0.525
White 1756 (59%) 1556 (58%) 200 (67.3%)
Mexican American 538 (18.1%) 504 (18.8%) 34 (11.4%)
Other Hispanic 56 (1.9%) 53 (2%) 3 (1%)
Black 534 (17.9%) 481 (17.9%) 53 (17.8%)
Other 94 (3.2%) 87 (3.2%) 7 (2.4%)
29 TLAC 12AM-2AM 25 (60.1) 25.1 (61.4) 24.6 (47.7) 0.511
30 TLAC 10PM-12AM 85.2 (98.7) 86.3 (100.5) 75.3 (80.6) 0.506
31 TLAC 4AM-6AM 39.1 (83.2) 39.6 (85.1) 34.1 (64.1) 0.504
32 TLAC 2AM-4AM 15.7 (52) 15.4 (52.7) 18.3 (44.9) 0.5
33 Wear time 878.5 (138.5) 877.1 (134.4) 891.7 (170.8) 0.454
Note:
Data are presented as means (standard deviation) or n (%).

Figure 3 displays the AIC, EPIC, and AUC at each stage of the forward selection procedure when using cross-validated AUC as the selection criteria. The scale for AIC and EPIC is shown on the right y axis, while the scale for AUC is shown on the left y axis. Accelerometry predictors are denoted in red font on the horizontal axis. In addition, the stopping point for the procedure is denoted by the large red dot. The blue and green dots correspond to the model which had the lowest (best) EPIC and AIC values, respectively.

AUC for the 5-year mortality model
Variable Cross.Validated.AUC
TAC 0.771
Age 0.791
Smoking Status 0.800
CHF 0.811
Drinking Status 0.817
ASTP 0.823
Mobility problem 0.828
Gender 0.830
SD on PC 6 (surrogate) 0.833
Diabetes 0.836
Education 0.837
TLAC 12AM-2AM 0.838
Stroke 0.838

Table 6 presents estimated final model coefficients with corresponding standard errors and significance values, according to the complex survey design of NHANES via the svyglm() function. The number of variables used in the model is selected according to across validated AUC criteria. The “adjusted” weights we use for regression analyses are “wtmec4yr_adj_norm”. These weights are calculated using the reweight_accel() function (see ?reweight_accel for help) which re-weights observed participants using age, gender, and ethnicity strata.

## Final model fits using the adjusted survey weights.
fit_final <- svyglm(as.formula(paste("yr5_mort ~", final_formula )),
                        design=data_analysis_svy_adj,
                        family=quasibinomial())
Table 6: Estimated final model coefficients with corresponding standard errors and significance values
Estimate Pr(>|t|) OR Estimate OR Lower CL OR Upper CL
Intercept -9.511 0.000 0.000 0.000 0.001
TAC 0.007 0.982 1.007 0.508 1.832
Age 0.083 0.000 1.087 1.063 1.112
Former Smoker 0.332 0.176 1.394 0.835 2.345
Current Smoker 0.797 0.002 2.219 1.412 3.478
CHF: yes 0.777 0.013 2.175 1.177 3.930
Non-Drinker 0.565 0.010 1.759 1.165 2.677
Heavy Drinker 0.963 0.018 2.620 1.148 5.673
Missing Alcohol 0.747 0.106 2.111 0.752 5.193
ASTP 0.382 0.016 1.465 1.078 1.993
Mobility Problem 0.546 0.028 1.726 1.057 2.816
Gender: female -0.648 0.007 0.523 0.332 0.817
SD on PC 6 (surrogate) -0.291 0.002 0.748 0.629 0.885
Diabetes: yes 0.216 0.310 1.241 0.780 1.937
High school Education -0.008 0.973 0.992 0.582 1.694
More than high school Education -0.231 0.309 0.794 0.489 1.294
TLAC 12AM-2AM 0.128 0.099 1.137 0.958 1.322
StrokeYes 0.193 0.505 1.213 0.636 2.227

Comparison to the models without PA accelerometry-derived measures

Non-PA model:

#Age, Smoking status, Coronary Heart Failure (CHF), Drinking status, Active-to-sedentary transition probability, Mobility Problem, Gender, Diabetes, Education, and Stroke
noPAvars <- c("Age", "SmokeCigs", "CHF", "DrinkStatus", "MobilityProblem",
              "Gender", "Diabetes", "EducationAdult", "Stroke")
noPA_formula <- paste0(noPAvars,collapse="+")
fit_svy_noPA <- svyglm(as.formula(paste("yr5_mort ~", noPA_formula  )),design=data_analysis_svy_adj, family=quasibinomial())

AUC for the non-PA model: calculate 10-fold CV AUC for the no-PA model

Cross validated AUC in the non-PA model is 0.799

Comparison with the best forward selction no-PA model

Model selection results (cross-validated AUC criteria)

Cross-validated AUC of the forward selection model without physical activity: the optimal 10 variable model is 0.803.

AUC for the model without physical activity predictors
Variable Cross.Validated.AUC
Age 0.740
MobilityProblem 0.767
SmokeCigs 0.781
Gender 0.788
DrinkStatus 0.795
CHF 0.799
Diabetes 0.801
BMI_cat 0.802
Stroke 0.803
Cancer 0.803

Subgroups of the population

## 
##    0    1 
## 1828   98
## 
##   0   1 
## 853 199
all all Age 50-70 Age 50-70 Age 70 above Age 70 above
TAC 0.771 TAC 0.701 ASTP 0.711
Age 0.758 MVPA 0.700 TAC 0.706
MVPA 0.745 MobilityProblem 0.677 TLAC_6 0.685
ASTP 0.733 TLAC 0.672 ST 0.679
ST 0.728 ST 0.661 Age 0.676
TLAC 0.721 EducationAdult 0.658 TLAC_7 0.655
TLAC_7 0.697 TLAC_8 0.655 TLAC 0.654
TLAC_9 0.697 TLAC_10 0.654 MVPA 0.654
TLAC_8 0.693 TLAC_9 0.650 sPC6 0.649
TLAC_10 0.692 TLAC_7 0.648 TLAC_8 0.636
TLAC_6 0.682 Age 0.647 TLAC_10 0.634
MobilityProblem 0.672 ASTP 0.642 TLAC_9 0.633
sPC6 0.661 SATP 0.636 TLAC_5 0.626
SATP 0.658 DrinkStatus 0.625 MobilityProblem 0.616
TLAC_5 0.629 SmokeCigs 0.624 SmokeCigs 0.589
EducationAdult 0.612 TLAC_6 0.621 SATP 0.588
TLAC_11 0.603 sPC6 0.620 TLAC_2 0.581
DrinkStatus 0.602 TLAC_4 0.614 Gender 0.581
TLAC_4 0.590 Diabetes 0.605 TLAC_3 0.569
SmokeCigs 0.586 TLAC_11 0.593 TLAC_11 0.561
CHF 0.570 CHF 0.580 CHF 0.553
Gender 0.559 TLAC_5 0.577 TLAC_1 0.548
Diabetes 0.558 WT 0.566 DrinkStatus 0.546
Cancer 0.558 TLAC_1 0.558 BMI_cat 0.543
BMI_cat 0.553 Gender 0.553 Stroke 0.542
CHD 0.552 CHD 0.551 WT 0.540
Stroke 0.546 Cancer 0.549 CHD 0.533
Race 0.525 Stroke 0.544 TLAC_4 0.525
TLAC_1 0.511 BMI_cat 0.540 Diabetes 0.516
TLAC_12 0.506 TLAC_3 0.532 Cancer 0.516
TLAC_3 0.504 Race 0.519 TLAC_12 0.501
TLAC_2 0.500 TLAC_2 0.465 EducationAdult 0.496
WT 0.454 TLAC_12 0.401 Race 0.493
## [1] "TAC+EducationAdult+CHF+TLAC_10+Cancer+CHD+Diabetes+DrinkStatus+TLAC_8+TLAC_4+SmokeCigs+MVPA+MobilityProblem+TLAC_2+WT+BMI_cat+Stroke"
## [1] "ASTP+Age+sPC6+SmokeCigs+Gender+CHF+TLAC_2"

Robustness of the final model results

Our analysis includes 20 variables derived directly from accelerometry measurements. These variables may be highly correlated with one another and with age. The correlation plot between age and all activity derived variables is presented below. Age has high negative correlations with TAC, TLAC and positive correlation with sedentary time. TAC is highly correlated with most activity derived measures, including MVPA, SATP, Sedentary time, TLAC, and TLAC 4PM – 6PM, 6PM – 8PM, 2PM – 4PM, 12PM – 2PM, 10PM – 12 PM, 8AM – 10AM, 6AM – 8AM. ASTP, sedentary time, TLAC and SATP are among top activity derived measures that are highly correlated with multiple other variables. The surrogate for the standard deviation on the 6th PC has low correlation with other activity derived variables, which explains its contribution to the increase in AUC in the forward selection model beyond the conventional activity measures.

Due to high correlation among different activity summaries, many of these variables contain similar information and often replace each other in the final mortality prediction model. Thus, in practice, we suggest examining a smaller subset of activity derived variables including: TLAC, sedentary and wear time, MVPA, SATP, ASTP, and the surrogates the standard deviation of the sixth PC (SD on PC 6).

## quartz_off_screen 
##                 2

The 5-year mortality prediction model reported in this analysis used cross validated AUC as the forward selection criteria. We examined robustness of our results to using EPIC (Figure 4) and AIC (Figure 5) criteria to enter a new variable into the model. While there is difference in the order in which the variables enter into the model, all 3 criteria approaches select mostly the same top 8 variables for the final model. The notable exception is that neither AIC nor EPIC select TAC into the model. This is likely due to the fact that age and TAC have similar explanatory power and both EPIC/AIC select age first. Then, once age is included in the model, other accelerometry measures explain more of the variability unexplained by age than TAC.

References

  1. Thomas Lumley, Alastair Scott; AIC and BIC for modeling with complex survey data, Journal of Survey Statistics and Methodology, Volume 3, Issue 1, 1 March 2015, Pages 1-18, https://doi.org/10.1093/jssam/smu021

  2. Leroux A. : rnhanesdata: NHANES Accelerometry Data Pipeline. URL:https://github.com/andrew-leroux/rnhanesdata. R package version 1.0.