We access the prediction performance of objective physical activity measures and their ranking relative to other established predictors of 5-year all-cause mortality in the US. In this analysis, we use participants between 50 and 85 years old from the 2003-2004 and 2005-2006 samples of the National Health and Nutritional Examination Survey (NHANES, n=2978, number of deaths in 5 years=297) who wore a hip-worn accelerometer in the free living environment for up to 7 days.

Note: the html file displays only partial R code neccesary to illustrate the functions and arguments used to execute the analysis. Full code is available in the corresponding .Rmd file.

Prerequisites

The following packages will be used in this vignette to provide illustration of NHANES 5-year mortality prediction model. Note that we need to use R’s old default random number generator to reproduce published results.

library("knitr")
library("kableExtra")
library("devtools")
library("ggplot2")
library("gridExtra")
library("corrplot")
library("reshape2")
library("magrittr")
library("plyr")
library("dplyr")
library("survey")
library("mgcv")
library("refund")
library("rnhanesdata")
RNGkind(sample.kind="Rounding")

NHANES 2003-2004 and 2005-2006 5-year mortality model data selection

We start by downloading NHANES 2003-2004 and 2005-2006 cohorts’ information, processing data and combining survey weights for the two cohorts using R package rnhanesdata (Leroux, 2018). Items 1 - 11 illustrate processing steps we performed to obtain the subset of NHANES data that meets this analysis criteria, and derive activity summaries. These steps produce an R object called “data_analysis”, which is used in 5-year mortality prediction model. **For simplicity of presentation, we suppress code printout at each step, which can be changed by setting echo = TRUE in the R chunk option (e.g., {r, echo = TRUE}). Full code is available in the vignette_prediction.Rmd file*

Download and process NHANES lab measurements from the 2 cohorts: systolic blood pressure readings, total cholesterol, and HDL cholesterol. These data are saved in the temporary directory, which is then removed once the covariates information is processed. Blood pressure is recorded up to 4 times here for each participant. The data collection procedure and description of cholesterol and blood pressure variables is available at https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination&CycleBeginYear=2003 for the wave 2003-2004 and at https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination&CycleBeginYear=2005 for the wave 2005-2006.

# Create a (local) temporary directory 
# where lab measurement (cholesterol, blood presure) data will be downloaded from the CDC website 
# and then loaded into R. These files need to be downloaded separately as 
# the raw files associated with these lab measurements are not included in the rnhanesdata package.
dir_tmp = tempfile()

if (!dir.exists(dir_tmp)){
  dir.create(dir_tmp, showWarnings = FALSE)
}
dl_file = function(url) {
  bn = basename(url) 
  destfile = file.path(dir_tmp, bn)
  if (!file.exists(destfile)) {
    out = download.file(url, destfile = destfile, mode="wb")
  }
  stopifnot(file.exists(destfile))
}
## download the lab measurement data for the cohort 2003-2004
# Cholesterol - Total & HDL: LBXTC and LBXHDD
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2003-2004/L13_C.XPT")
# Blood Pressure: BPXSY1 , BPXSY2, BPXSY3 and BPXSY4
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2003-2004/BPX_C.XPT")

## download the lab measurement data for the cohort 2005-2006
# Total Cholesterol: LBXTC
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/TCHOL_D.XPT")
# HDL Cholesterol: LBDHDD
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/HDL_D.XPT")
# Blood Pressure, up to 4 measurements per person: BPXSY1 , BPXSY2, BPXSY3 and BPXSY4
dl_file("https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/BPX_D.XPT")


varnames <- c("LBXTC","LBXHDD","LBDHDD",           ## 1. cholesterol. Note LBXHDD and LBDHDD are the same variable, 
                                                   ##    but different names for 2003-2004 and 2005-2006 cohorts
              "BPXSY1","BPXSY2","BPXSY3", "BPXSY4" ## 2. blood pressure measurements
              )


## load and merge the lab data
lab_data <- process_covar(varnames = varnames,localpath=dir_tmp)

## change column name for cholesterol variable that changed names
colnames(lab_data$Covariate_C)[colnames(lab_data$Covariate_C) == "LBXHDD"] <- "LBDHDD"

## combine waves
CVMarkers <- bind_rows(lab_data$Covariate_C, lab_data$Covariate_D)

rm(list=c("lab_data","dir_tmp","varnames"))

Load minute-level activity data and combine it with lab measurements, survey sampling data, and mortality data which are included in the rnhanesdata package.

## load the data
data("PAXINTEN_C");data("PAXINTEN_D")
data("Flags_C");data("Flags_D")
data("Mortality_2015_C");data("Mortality_2015_D")
data("Covariate_C");data("Covariate_D")

## re-code activity counts which are considered "non-wear" to be 0
## this doesn't impact many data points, most estimated non-wear times correspond to 0 counts
col_vars = paste0("MIN",1:1440)
PAXINTEN_C[, col_vars] <- PAXINTEN_C[, col_vars]*Flags_C[, col_vars]
PAXINTEN_D[, col_vars] <- PAXINTEN_D[, col_vars]*Flags_D[, col_vars]

## Merge covariate, mortality, and accelerometry data
## note that both PAXINTEN_* and Covariate_* have a column
## called "SDDSRVYR" indicating which NHANES wave the data is associated with.
## To avoid duplicating this column in the merged data, we add this variable to the "by"
## argument in left_join()
AllAct_C <- left_join(PAXINTEN_C, Mortality_2015_C, by = "SEQN") %>%
        left_join(Covariate_C, by=c("SEQN", "SDDSRVYR"))
AllAct_D <- left_join(PAXINTEN_D, Mortality_2015_D, by = "SEQN") %>%
        left_join(Covariate_D, by=c("SEQN", "SDDSRVYR"))

AllFlags_C <- left_join(Flags_C, Mortality_2015_C, by = "SEQN") %>%
        left_join(Covariate_C, by=c("SEQN", "SDDSRVYR"))
AllFlags_D <- left_join(Flags_D, Mortality_2015_D, by = "SEQN") %>%
        left_join(Covariate_D, by=c("SEQN", "SDDSRVYR"))

## clean up the workspace for memory purposes
rm(list=c(paste0(c("PAXINTEN_", "Covariate_","Mortality_2015_","Flags_"),rep(LETTERS[3:4],each=4))))
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   2097600  112.1    4267247  227.9         NA   3290721  175.8
## Vcells 155701530 1188.0  400283835 3054.0      16384 342513909 2613.2

## combine data for the two waves
AllAct   <- bind_rows(AllAct_C,AllAct_D)
rm(list = c("AllAct_C","AllAct_D"))
AllFlags <- bind_rows(AllFlags_C,AllFlags_D)
rm(list = c("AllFlags_C","AllFlags_D"))

#merge with cardiovascular markers 
AllAct <- left_join(AllAct, CVMarkers, by = "SEQN")
AllFlags <- left_join(AllFlags, CVMarkers, by = "SEQN")

## clean up the workspace again
rm(list = "CVMarkers");
gc()

##             used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   2095065  111.9    4267247  227.9         NA   4267247  227.9
## Vcells 156798177 1196.3  400283835 3054.0      16384 400258277 3053.8

Create new factor covariates from NHANES questionnaire, which will be used in the prediction model. In addition, calculate average systolic blood pressure using the 4 measurements per participant.

## Code year 5 mortality, NAs for individuals with follow up less than 5 years and alive
AllAct$yr5_mort <- AllFlags$yr5_mort <- as.integer(
  ifelse(AllAct$permth_exm/12 <= 5 & AllAct$mortstat == 1, 1,
         ifelse(AllAct$permth_exm/12 < 5 & AllAct$mortstat == 0, NA, 0))
                                                    )
## Create Age in years using the age at examination (i.e. when participants wore the device)
AllAct$Age <- AllFlags$Age <- AllAct$RIDAGEEX/12


## Re-level comorbidities to assign refused/don't know as not having the condition
## Note that in practice this does not affect many individuals, but it is an assumption we're making.
levels(AllAct$CHD)    <- levels(AllFlags$CHD)    <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$CHF)    <- levels(AllFlags$CHF)    <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$Stroke) <- levels(AllFlags$Stroke) <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$Cancer) <- levels(AllFlags$Cancer) <- list("No" = c("No","Refused","Don't know"), "Yes" = c("Yes"))
levels(AllAct$Diabetes) <- levels(AllFlags$Diabetes) <- list("No" = c("No","Borderline", "Refused","Don't know"), "Yes" = c("Yes"))


## Re-level education to have 3 levels and categorize don't know/refused to be missing
levels(AllAct$EducationAdult) <- levels(AllFlags$EducationAdult) <- list("Less than high school" = c("Less than 9th grade", "9-11th grade"),
                                                                         "High school" = c("High school grad/GED or equivalent"),
                                                                         "More than high school" = c("Some College or AA degree", "College graduate or above"))

## Re-level alcohol consumption to include a level for "missing"
levels(AllAct$DrinkStatus) <- levels(AllFlags$DrinkStatus) <- c(levels(AllAct$DrinkStatus), "Missing alcohol")
AllAct$DrinkStatus[is.na(AllAct$DrinkStatus)] <- AllFlags$DrinkStatus[is.na(AllAct$DrinkStatus)] <- "Missing alcohol"


# systolic blood pressure calculation 
AllAct$SYS <- AllFlags$SYS <-  round( rowMeans(AllAct[,c("BPXSY1","BPXSY2","BPXSY3", "BPXSY4")], na.rm = TRUE))

## Re-order columns so that activity and wear/non-wear flags are the last 1440 columns of our two
## data matrices. This is a personal preference and is absolutely not necessary.
act_cols <- which(colnames(AllAct) %in% paste0("MIN",1:1440))
oth_cols <- which(!colnames(AllAct) %in% paste0("MIN",1:1440))
AllAct   <- AllAct[,c(oth_cols,act_cols)]
AllFlags <- AllFlags[,c(oth_cols,act_cols)]
rm(list=c("act_cols","oth_cols"))

Calculate daily activity summary measures: total activity count (TAC), total log activity count (TLAC), total accelerometer wear time (WT), total minutes of moderate/vigorous physical activity (MVPA), sedentary/sleep/non-wear to active transition probability (SATP\(_{sl/nw}\)), and active to sedentary/sleep/non-wear transition probability (ASTP\(_{sl/nw}\)). In addition, compute total log activity count summary measures (TLAC_1, TLAC_2, …, TLAC_12) in each 2-hr window, i.e. 12AM-2AM, 2AM-4AM, 4AM-6AM, etc. Note, there is one individual with 501 minutes recorded as NA. These missing data occur on the last day they wore the device for the last 501 minutes of the day. We impute these missing data with 0.

## Assign just the activity and wear/non-wear flag data to matrices.
## This makes computing the features faster but is technically required.
act_mat  <- AllAct[,paste0("MIN",1:1440)]
act_mat = as.matrix(act_mat)

## replace NAs with 0s
for (icol in seq(ncol(act_mat))) {
  act_mat[, icol][is.na(act_mat[, icol])]   <- 0
}

AllAct$TAC   <- AllFlags$TAC   <- rowSums(act_mat)
AllAct$TLAC  <- AllFlags$TLAC  <- rowSums(log(1 + act_mat))
AllAct$ST    <- AllFlags$ST    <- rowSums(act_mat < 100)
AllAct$MVPA  <- AllFlags$MVPA  <- rowSums(act_mat >= 2020)

# compute total log activity count in each 2-hr window, 
# 2 hour (120 minutes) binning window 
tlen <- 120
nt   <- floor(1440/tlen)
# create a list of indices for binning into 2-hour windows
inx_col_ls <- split(1:1440, rep(1:nt,each=tlen))
Act_2hr    <- sapply(inx_col_ls, function(x) rowSums(log(1+act_mat[,x,drop=FALSE])))
colnames(Act_2hr) <- paste0("TLAC_",c(1:12))
rm(list=c("tlen","nt","inx_col_ls"))

act_mat = act_mat >= 100
## calculate fragmentation measures
get_sed_act = function(x){
                        mat <- rle(x)
                        rm(x)
                        sed <- mat$lengths[!mat$values]
                        act <- mat$lengths[mat$values]

                        sed <- ifelse(length(sed) == 0, NA, mean(sed))
                        act <- ifelse(length(act) == 0, NA, mean(act))
                        c(sed, act)
}
bout_mat = matrix(nrow = 2, ncol = nrow(act_mat))
for (i in seq(nrow(act_mat))) {
  bout_mat[, i] = get_sed_act(act_mat[i, ])
}
rm(act_mat)

AllAct$SBout <- AllFlags$SBout <- bout_mat[1,]
AllAct$ABout <- AllFlags$ABout <- bout_mat[2,]
AllAct$SATP  <- AllFlags$SATP  <- 1/AllAct$SBout
AllAct$ASTP  <- AllFlags$ASTP  <- 1/AllAct$ABout

rm(bout_mat)

AllAct$WT    <- AllFlags$WT    <- rowSums(AllFlags[,paste0("MIN",1:1440)], na.rm = TRUE)

AllAct   <- cbind(AllAct, Act_2hr)
AllFlags <- cbind(AllFlags, Act_2hr)

rm(list = c("Act_2hr"))

Exclude participants who were: 1) younger than 50 years old, or 85 and older at the time they wore the accelerometer (10, 859 participants); 2) missing BMI or education predictor variables (41 participants); 3) had fewer than 3 days of data with at least 10 hours of estimated wear time (517 participants); 4) missing mortality information (16 participants); 5) alive with follow up less than 5 years (17 participants); 6) missing systolic blood pressure, total cholesterol or HDL cholesterol measurements (293 participants). Table 1 below displays number of participants with missing data for all pairwise combinations of variables with missing data.

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2100247 112.2    4267247  227.9         NA   4267247  227.9
## Vcells 54494542 415.8  461267778 3519.2      16384 564607656 4307.7

Table 1: Missing data patterns
	BMI	Education	Bad Accel Data	Mortality	Lab
BMI	35	0	8	1	7
Education	0	6	2	0	1
Bad Accel Data	8	2	517	4	57
Mortality	1	0	4	21	1
Follow-up	0	0	0	0	0
Lab	7	1	57	1	293

Create our dataset for analysis, “data_analysis”, with one row per subject containing only those subjects who meet our inclusion criteria.

Thus, after exclusion ctriteria were applied, the data set contained 2978 subjects. Among these participants, 86 had missing alcohol information; for these participants we introduced the category “Missing Alcohol” and retained them in the dataset.

Maximum follow-up time in this cohort in years

## [1] 13.08333

Calculate subject specific averages of the accelerometry features using only the “good” days of data.
Get log(count +1) activity data in a matrix format and conduct functional Principal Component analysis (fPCA) using the fpca.face() function with 50 knots in the refund package in R.

Act <- log(1 + Act_Analysis[,paste0("MIN",1:1440)])
Act[is.na(Act)] <- 0
Act = as.matrix(Act)
fpca_fit <- fpca.face(Act,knots=50)

Retain the first 6 PCs, which explain approximately 57% variability in the daily activity data (the space spanned by the rows of the matrix of non-smoothed log transformed activity counts) and obtain the score for each day on each PC and calculate the mean and standard deviation of these scores for each subject across days.

Identify the PC-specific mean and standard deviation that are significantly associated with 5-year mortality using backward selection logistic model fitted with svyglm() function in R. In this step, the demographic, behavioral and comorbidity variables were included in all models and the selection was conducted on the means and standard deviations of the scores on the 6 PCs (a total of 12 variables) using the compelx survey AIC criteria (Lumley and Scott 2015).

Note, there are 86 participants with missing alcohol information, which was coded as an additional level of the DrinkStatus variable.

Table 2: Final backward selection model estimated coefficients
	Estimate	Pr(>\|t\|)	OR Estimate	OR Lower CL	OR Upper CL
Intercept	-7.590	0.000	0.001	0.000	0.004
Age	0.080	0.000	1.083	1.059	1.108
Former Smoker	0.487	0.021	1.627	1.073	2.485
Current Smoker	0.941	0.000	2.563	1.797	3.648
Non-Drinker	0.567	0.000	1.763	1.279	2.440
Heavy Drinker	1.053	0.002	2.866	1.423	5.576
Missing Alcohol	0.816	0.065	2.261	0.875	5.197
Underweight	0.597	0.384	1.817	0.400	6.520
Overweight	-0.554	0.005	0.575	0.386	0.854
Obese	-0.509	0.000	0.601	0.478	0.755
Diabetes: yes	0.286	0.161	1.331	0.879	1.989
CHF: yes	0.833	0.001	2.300	1.356	3.841
CHD: yes	0.113	0.671	1.120	0.649	1.879
Stroke: yes	0.287	0.299	1.332	0.756	2.279
Cancer: yes	0.273	0.100	1.314	0.939	1.823
Mobility Problem	0.620	0.001	1.859	1.270	2.719
Total Cholesterol	-0.003	0.243	0.997	0.992	1.002
HDL Cholesterol	-0.001	0.840	0.999	0.989	1.008
Systolic Blood Pressure	0.000	0.949	1.000	0.993	1.008
mi1	0.014	0.008	1.014	1.004	1.026
si6	-0.077	0.000	0.926	0.888	0.965

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2170190 116.0    4267247  227.9         NA   4267247  227.9
## Vcells 98629884 752.5  369014223 2815.4      16384 564607656 4307.7

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2169664 115.9    4267247  227.9         NA   4267247  227.9
## Vcells 98629139 752.5  295211379 2252.3      16384 564607656 4307.7

Replace the significant PC based measures with corresponding surrogate variables. This step is done to improve the interpretability and comparability (across studies) of results. For those unfamiliar with principal component analysis, reading the section “Intuition behind fPCA” below may be helpful in understanding this step further.

Identifying potential surrogate measures is based on (subjective) interpretations involving the shapes of each principal component. Fundamentally, the idea is to use visual inspection of the principal components to identify the “dominant” features of each component. Then, we return to the original data, and calculate a statistic which we believe captures this dominant feature. For example, looking at the upper-left panel of the plot above in Step 10, we see that days which load negatively on the first the first principal component tend to be extremely active, while those who load positively tend to be very inactive. As a result, one reasonable “guess” at a surrogate measure which is highly associated with average first component is simply the total log transformed activity count (TLAC) for that day. If that is true, we would also expect that the average PC1 score within subjects is highly correlated with their average TLAC across days. In our data, average TLAC and average PC1 score are highly (negatively) correlated (\(\hat{\rho}\) =-0.87), which is expected based on the sign of the first PC.

This procedure would then be repeated for each feature identified as potentially predictive. In our application, we are also interested in the standard deviation of PC6. Looking at the bottom-right panel of the plot above in Step 10, we see that there are 6 periods where the contrast is highest between days with positive and negative loadings (i.e. the difference between the \(+\) and \(-\) curves are largest). One reasonable guess for a statistic which is highly correlated with PC6 score is the difference in average activity during the specific time periods where positive/negative loadings are high/low, respectively. For example, days that load highly on PC6 should, on average, have higher activity during the mid morning (8AM-10AM), late afternoon (3PM-5PM) and late evening (10PM-12AM) and lower activity during the early morning (5AM-7AM), late morning/early afternoon (11AM-1PM), and early evening (6PM-8PM). Since we’re interested in the standard deviation of PC6 score, we calculate the standard deviation of average log-transformed activity counts during these periods as a surrogate measure for \(s_{i6}\). In our analysis, we used all 6 of these time periods and obtained an observed correlation of \(\hat{\rho}\) = 0.87, though multiple choices can be explored to see which statistic has the highest correlation with \(s_{i6}\).

Intuition behind fPCA

A major problem with PC analysis is that it is not always intuitive and requires a degree of familiarity with matrix algebra and complex trajectories (functional data analysis in statistics speak). While some of these problems are unavoidable given the complexity of the data, we will now provide the needed intuition for understanding both the PCs and the implication of our findings on the original data scale (daily minute-level activity profiles). We will start with explaining the first 2PCs, which are shown in the left-top panel of Figure 1.

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2361997 126.2    4267247  227.9         NA   4267247  227.9
## Vcells 99068442 755.9  295211379 2252.3      16384 564607656 4307.7

The first PC (solid blue line) captures 22% of the overall variability in the observed daily profiles of activity. It has a distinct shape, with values starting negative between 12AM and 5AM then becoming strongly positive with a peak around 8AM and slowly decreasing but staying positive until 9:30PM, and then becoming negative after 9:30PM. This is exactly what we expected to see. For each subject, we have 3 - 7 days of valid activity data. Individuals with a positive score on this component (a.k.a., positively loaded on the first PC) on a given day will tend to have less activity during the night hours and more activity during the day hours than the average activity across all subject-days. The biggest difference between such a subject’s day and the average daily activity across all subjects is centered on the morning hours (8AM-9AM). In contrast, the second PC (dashed line) captures 14% of the overall variability in the observed daily activity profiles. It starts positive between 12AM and 2AM, then becomes negative between 2AM and 11AM, with a negative peak at 8AM, increases between 11 AM and 8PM, and decreases while staying positive after 8PM. Participants days with positive scores on this component will be more active in the evening than the average individuals dayly activity.

Now, let us investigate in detail the connection between PCs, scores and individual daily trajectories. The right panel in Figure 1 displays the one day of activity profiles for 3 subjects. Here, we plot the activity data smoothed for each subject and day using thin-plate penalized spline with 30 knots as implemented in the gam() function in the mgcv package in R. The individual days for subjects 21009 (red line) and 21913 (green line) have overall high activity, which is reflected by the high positive loadings on PC 1 (55.89 and 58.75, respectively). In contrast, the individual day’s activity for subject 21039 (blue line) has lower levels, which is reflected by the negative score on the first PC (-21.12). Subjects 21009 (red line) and 21039 (blue line) are mostly active between 7AM and 9PM and their corresponding scores of the second principal component are negative (-30.91 and -36.08, respectively). Subject 21913 (green line) is however, unusually highly active during night hours (8PM - 2AM) with a highly positive score on the second PC (66.05).

The last, but not least important interpretation is of means of scores versus standard deviation of scores. In Figure 1, we have displayed three days, one for each subject. However, each subject has multiple days and each day will get a score and a pattern. For example, on days when subject 21009 (red line) is less active the score on PC1 will be lower, even if the general pattern stays the same. Thus, for every subject and PC we obtain a vector of scores; for 21009 (red line) we obtain (38.39, 55.89, 35.55, 60.61, 40.91, 48.67, -21.25) on the first PC, where we showed only the trajectory corresponding to the 3rd day. What we calculate is the mean of these scores, 36.97, and the standard deviation, 27.27. The mean of the scores is relatively easy to interpret, as it represents whether the average of the 7 days is higher or lower on PC1. The standard deviation captures the day-to-day variability of the individual. In this case the mean and standard deviation for subject 21009 (red line) were 36.97 and 27.27, respectively. In contrast, for subject 21039 (blue line), the mean and standard deviation were -19.16 and 5.02, respectively, much smaller than for subject 21009 (red line). This means that both the overall mean and the daily variability around this larger mean are larger for subject 21009 than for subject 21039 (blue line). This is depicted in Figure 2, where we show smoothed activity data for all days for subjects 21009 (Figure 2a red lines) and 21039 (Figure 2b blue lines). Note that the red lines tend to be higher the blue lines and that the blue lines are less variable around their means. We conclude that, in general, average PC1 scores will tend to distinguish between lower and higher activity individuals whose are, on average, more active over the course of days with available activity data. In contrast, average PC2 scores will distinguish between individuals who, on average, have high activity in the morning and low in the evening/night and individuals who, on average, have lower activity intensity in the morning and higher during the evening/nighttime.

Of course, things are more complicated once we start interpreting every component. Instead, in Table 1 we will provide just the interpretation of those components and summaries that were found to be predictive of the outcome.

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2295656 122.7    4267247  227.9         NA   4267247  227.9
## Vcells 58183398 444.0  236169104 1801.9      16384 564607656 4307.7

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2334742 124.7    4267247  227.9         NA   4267247  227.9
## Vcells 58360178 445.3  188935284 1441.5      16384 564607656 4307.7

Table 3: Interpretation of the results of FPCA
Result	Interpretation	Surrogates
(-) association mi1	Individuals with higher levels of overall activity during the day, and those who have higher early afternoon activity relative to early AM are associated with later mortality	Average TLAC
(+) association si6	Individuals who are more variable in the start time of their daily activity are associated with earlier mortality.	Standard deviation of ratio of mid-day to morning/afternoon activity 2. Standard deviation of the difference in average activity during peaks/troughs highlighted by PC6

##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  2295524 122.6    4267247  227.9         NA   4267247  227.9
## Vcells 58170931 443.9  188935284 1441.5      16384 564607656 4307.7

Mortality prediction model

Once we’ve subset the data to obtain “data_analysis”, we calculate adjusted survey weights. These weights are calculated using the reweight_accel() function (see ?reweight_accel for help) which re-weights observed participants using age, gender, and ethnicity strata (among other things).

Results

Participant characteristics by mortality status are detailed in Table 5. Predictors in Table 5 are ranked according to the AUC criteria in univariate logistic regression models, where each mortality prediction model was fitted with one covariate at a time. For example, TAC (Total activity count) is the most powerful single predictor of the 5-year mortality (AUC = 0.771), the next most predictive variable is Age (AUC = 0.758), and so on. We use cross validated survey weight adjusted AUC (auc_ij_adj column of object auc_mat_1_adj) calculated at the first step of the forward selection process.

Table 3: Demographic and clinical characteristics ranked by their predictive ability in the 5-year mortality model.
Rank	Characteristics	Total	Survivors	Decedents	AUC
1	TAC	209786.1 (112911)	217926 (111868.8)	136307.9 (94316.3)	0.771
2	Age	65.9 (9.6)	65 (9.3)	73.4 (8.9)	0.758
3	MVPA	13.8 (17.1)	14.7 (17.3)	6.5 (12.1)	0.745
4	ASTP	0.3 (0.09)	0.29 (0.08)	0.37 (0.11)	0.733
5	Sedentary time	1110.6 (108.4)	1102.4 (105.1)	1184 (110.5)	0.728
6	TLAC	2758.5 (727.2)	2811.7 (705.6)	2278.7 (744.9)	0.721
7	TLAC 12PM-2PM	402.5 (117.3)	410.2 (113.9)	333.4 (125.3)	0.697
8	TLAC 4PM-6PM	373.8 (114.7)	381 (112.2)	309.3 (116.4)	0.697
9	TLAC 2PM-4PM	390.6 (119.3)	398.1 (116.7)	323 (121.5)	0.693
10	TLAC 6PM-8PM	313.5 (119.3)	320.5 (118.4)	250.7 (108.7)	0.692
11	TLAC 10AM-12PM	403.4 (129.8)	410.9 (127.4)	335.3 (132.3)	0.682
12	Mobility problem	940 (31.6%)	768 (28.6%)	172 (57.9%)	0.672
13	SD on PC 6 (surrogate)	0.69 (0.27)	0.7 (0.27)	0.57 (0.25)	0.661
14	SATP	0.08 (0.02)	0.08 (0.02)	0.07 (0.02)	0.658
15	TLAC 8AM-10AM	338.5 (154)	344.7 (153.3)	282.4 (149.3)	0.629
16	Education				0.612
	Less than high school	945 (31.7%)	822 (30.7%)	123 (41.4%)
	High school	739 (24.8%)	659 (24.6%)	80 (26.9%)
	More than high school	1294 (43.5%)	1200 (44.8%)	94 (31.6%)
17	TLAC 8PM-10PM	204.4 (121.5)	208.8 (122.6)	165.4 (104.1)	0.603
18	Drinking Status				0.602
	Moderate Drinker	1445 (48.5%)	1346 (50.2%)	99 (33.3%)
	Non-Drinker	1266 (42.5%)	1106 (41.3%)	160 (53.9%)
	Heavy Drinker	181 (6.1%)	153 (5.7%)	28 (9.4%)
	Missing alcohol	86 (2.9%)	76 (2.8%)	10 (3.4%)
19	TLAC 6AM-8AM	166.7 (151.2)	171.1 (153.5)	127 (121.9)	0.59
20	Smoking Status				0.586
	Never	1326 (44.5%)	1233 (46%)	93 (31.3%)
	Former	1147 (38.5%)	1010 (37.7%)	137 (46.1%)
	Current	505 (17%)	438 (16.3%)	67 (22.6%)
21	CHF	168 (5.6%)	119 (4.4%)	49 (16.5%)	0.57
22	Gender				0.559
	Male	1523 (51.1%)	1331 (49.6%)	192 (64.6%)
	Female	1455 (48.9%)	1350 (50.4%)	105 (35.4%)
23	Diabetes	518 (17.4%)	444 (16.6%)	74 (24.9%)	0.558
24	Cancer	455 (15.3%)	382 (14.2%)	73 (24.6%)	0.558
25	BMI				0.553
	Normal	760 (25.5%)	663 (24.7%)	97 (32.7%)
	Underweight	29 (1%)	22 (0.8%)	7 (2.4%)
	Overweight	1150 (38.6%)	1048 (39.1%)	102 (34.3%)
	Obese	1039 (34.9%)	948 (35.4%)	91 (30.6%)
26	CHD	244 (8.2%)	196 (7.3%)	48 (16.2%)	0.552
27	Stroke	174 (5.8%)	132 (4.9%)	42 (14.1%)	0.546
28	Race				0.525
	White	1756 (59%)	1556 (58%)	200 (67.3%)
	Mexican American	538 (18.1%)	504 (18.8%)	34 (11.4%)
	Other Hispanic	56 (1.9%)	53 (2%)	3 (1%)
	Black	534 (17.9%)	481 (17.9%)	53 (17.8%)
	Other	94 (3.2%)	87 (3.2%)	7 (2.4%)
29	TLAC 12AM-2AM	25 (60.1)	25.1 (61.4)	24.6 (47.7)	0.511
30	TLAC 10PM-12AM	85.2 (98.7)	86.3 (100.5)	75.3 (80.6)	0.506
31	TLAC 4AM-6AM	39.1 (83.2)	39.6 (85.1)	34.1 (64.1)	0.504
32	TLAC 2AM-4AM	15.7 (52)	15.4 (52.7)	18.3 (44.9)	0.5
33	Wear time	878.5 (138.5)	877.1 (134.4)	891.7 (170.8)	0.454
Note:
Data are presented as means (standard deviation) or n (%).

Figure 3 displays the AIC, EPIC, and AUC at each stage of the forward selection procedure when using cross-validated AUC as the selection criteria. The scale for AIC and EPIC is shown on the right y axis, while the scale for AUC is shown on the left y axis. Accelerometry predictors are denoted in red font on the horizontal axis. In addition, the stopping point for the procedure is denoted by the large red dot. The blue and green dots correspond to the model which had the lowest (best) EPIC and AIC values, respectively.

AUC for the 5-year mortality model
Variable	Cross.Validated.AUC
TAC	0.771
Age	0.791
Smoking Status	0.800
CHF	0.811
Drinking Status	0.817
ASTP	0.823
Mobility problem	0.828
Gender	0.830
SD on PC 6 (surrogate)	0.833
Diabetes	0.836
Education	0.837
TLAC 12AM-2AM	0.838
Stroke	0.838

Table 6 presents estimated final model coefficients with corresponding standard errors and significance values, according to the complex survey design of NHANES via the svyglm() function. The number of variables used in the model is selected according to across validated AUC criteria. The “adjusted” weights we use for regression analyses are “wtmec4yr_adj_norm”. These weights are calculated using the reweight_accel() function (see ?reweight_accel for help) which re-weights observed participants using age, gender, and ethnicity strata.

## Final model fits using the adjusted survey weights.
fit_final <- svyglm(as.formula(paste("yr5_mort ~", final_formula )),
                        design=data_analysis_svy_adj,
                        family=quasibinomial())

Table 6: Estimated final model coefficients with corresponding standard errors and significance values
	Estimate	Pr(>\|t\|)	OR Estimate	OR Lower CL	OR Upper CL
Intercept	-9.511	0.000	0.000	0.000	0.001
TAC	0.007	0.982	1.007	0.508	1.832
Age	0.083	0.000	1.087	1.063	1.112
Former Smoker	0.332	0.176	1.394	0.835	2.345
Current Smoker	0.797	0.002	2.219	1.412	3.478
CHF: yes	0.777	0.013	2.175	1.177	3.930
Non-Drinker	0.565	0.010	1.759	1.165	2.677
Heavy Drinker	0.963	0.018	2.620	1.148	5.673
Missing Alcohol	0.747	0.106	2.111	0.752	5.193
ASTP	0.382	0.016	1.465	1.078	1.993
Mobility Problem	0.546	0.028	1.726	1.057	2.816
Gender: female	-0.648	0.007	0.523	0.332	0.817
SD on PC 6 (surrogate)	-0.291	0.002	0.748	0.629	0.885
Diabetes: yes	0.216	0.310	1.241	0.780	1.937
High school Education	-0.008	0.973	0.992	0.582	1.694
More than high school Education	-0.231	0.309	0.794	0.489	1.294
TLAC 12AM-2AM	0.128	0.099	1.137	0.958	1.322
StrokeYes	0.193	0.505	1.213	0.636	2.227

Comparison to the models without PA accelerometry-derived measures

Non-PA model:

#Age, Smoking status, Coronary Heart Failure (CHF), Drinking status, Active-to-sedentary transition probability, Mobility Problem, Gender, Diabetes, Education, and Stroke
noPAvars <- c("Age", "SmokeCigs", "CHF", "DrinkStatus", "MobilityProblem",
              "Gender", "Diabetes", "EducationAdult", "Stroke")
noPA_formula <- paste0(noPAvars,collapse="+")
fit_svy_noPA <- svyglm(as.formula(paste("yr5_mort ~", noPA_formula  )),design=data_analysis_svy_adj, family=quasibinomial())

AUC for the non-PA model: calculate 10-fold CV AUC for the no-PA model

Cross validated AUC in the non-PA model is 0.799

Comparison with the best forward selction no-PA model

Model selection results (cross-validated AUC criteria)

Cross-validated AUC of the forward selection model without physical activity: the optimal 10 variable model is 0.803.

AUC for the model without physical activity predictors
Variable	Cross.Validated.AUC
Age	0.740
MobilityProblem	0.767
SmokeCigs	0.781
Gender	0.788
DrinkStatus	0.795
CHF	0.799
Diabetes	0.801
BMI_cat	0.802
Stroke	0.803
Cancer	0.803

Subgroups of the population

## 
##    0    1 
## 1828   98

## 
##   0   1 
## 853 199

all	all	Age 50-70	Age 50-70	Age 70 above	Age 70 above
TAC	0.771	TAC	0.701	ASTP	0.711
Age	0.758	MVPA	0.700	TAC	0.706
MVPA	0.745	MobilityProblem	0.677	TLAC_6	0.685
ASTP	0.733	TLAC	0.672	ST	0.679
ST	0.728	ST	0.661	Age	0.676
TLAC	0.721	EducationAdult	0.658	TLAC_7	0.655
TLAC_7	0.697	TLAC_8	0.655	TLAC	0.654
TLAC_9	0.697	TLAC_10	0.654	MVPA	0.654
TLAC_8	0.693	TLAC_9	0.650	sPC6	0.649
TLAC_10	0.692	TLAC_7	0.648	TLAC_8	0.636
TLAC_6	0.682	Age	0.647	TLAC_10	0.634
MobilityProblem	0.672	ASTP	0.642	TLAC_9	0.633
sPC6	0.661	SATP	0.636	TLAC_5	0.626
SATP	0.658	DrinkStatus	0.625	MobilityProblem	0.616
TLAC_5	0.629	SmokeCigs	0.624	SmokeCigs	0.589
EducationAdult	0.612	TLAC_6	0.621	SATP	0.588
TLAC_11	0.603	sPC6	0.620	TLAC_2	0.581
DrinkStatus	0.602	TLAC_4	0.614	Gender	0.581
TLAC_4	0.590	Diabetes	0.605	TLAC_3	0.569
SmokeCigs	0.586	TLAC_11	0.593	TLAC_11	0.561
CHF	0.570	CHF	0.580	CHF	0.553
Gender	0.559	TLAC_5	0.577	TLAC_1	0.548
Diabetes	0.558	WT	0.566	DrinkStatus	0.546
Cancer	0.558	TLAC_1	0.558	BMI_cat	0.543
BMI_cat	0.553	Gender	0.553	Stroke	0.542
CHD	0.552	CHD	0.551	WT	0.540
Stroke	0.546	Cancer	0.549	CHD	0.533
Race	0.525	Stroke	0.544	TLAC_4	0.525
TLAC_1	0.511	BMI_cat	0.540	Diabetes	0.516
TLAC_12	0.506	TLAC_3	0.532	Cancer	0.516
TLAC_3	0.504	Race	0.519	TLAC_12	0.501
TLAC_2	0.500	TLAC_2	0.465	EducationAdult	0.496
WT	0.454	TLAC_12	0.401	Race	0.493

## [1] "TAC+EducationAdult+CHF+TLAC_10+Cancer+CHD+Diabetes+DrinkStatus+TLAC_8+TLAC_4+SmokeCigs+MVPA+MobilityProblem+TLAC_2+WT+BMI_cat+Stroke"

## [1] "ASTP+Age+sPC6+SmokeCigs+Gender+CHF+TLAC_2"

Robustness of the final model results

Our analysis includes 20 variables derived directly from accelerometry measurements. These variables may be highly correlated with one another and with age. The correlation plot between age and all activity derived variables is presented below. Age has high negative correlations with TAC, TLAC and positive correlation with sedentary time. TAC is highly correlated with most activity derived measures, including MVPA, SATP, Sedentary time, TLAC, and TLAC 4PM – 6PM, 6PM – 8PM, 2PM – 4PM, 12PM – 2PM, 10PM – 12 PM, 8AM – 10AM, 6AM – 8AM. ASTP, sedentary time, TLAC and SATP are among top activity derived measures that are highly correlated with multiple other variables. The surrogate for the standard deviation on the 6th PC has low correlation with other activity derived variables, which explains its contribution to the increase in AUC in the forward selection model beyond the conventional activity measures.

Due to high correlation among different activity summaries, many of these variables contain similar information and often replace each other in the final mortality prediction model. Thus, in practice, we suggest examining a smaller subset of activity derived variables including: TLAC, sedentary and wear time, MVPA, SATP, ASTP, and the surrogates the standard deviation of the sixth PC (SD on PC 6).

## quartz_off_screen 
##                 2

The 5-year mortality prediction model reported in this analysis used cross validated AUC as the forward selection criteria. We examined robustness of our results to using EPIC (Figure 4) and AIC (Figure 5) criteria to enter a new variable into the model. While there is difference in the order in which the variables enter into the model, all 3 criteria approaches select mostly the same top 8 variables for the final model. The notable exception is that neither AIC nor EPIC select TAC into the model. This is likely due to the fact that age and TAC have similar explanatory power and both EPIC/AIC select age first. Then, once age is included in the model, other accelerometry measures explain more of the variability unexplained by age than TAC.

References

Thomas Lumley, Alastair Scott; AIC and BIC for modeling with complex survey data, Journal of Survey Statistics and Methodology, Volume 3, Issue 1, 1 March 2015, Pages 1-18, https://doi.org/10.1093/jssam/smu021
Leroux A. : rnhanesdata: NHANES Accelerometry Data Pipeline. URL:https://github.com/andrew-leroux/rnhanesdata. R package version 1.0.

The predictive performance of objective activity measures derived from accelerometry data for 5-year all-cause mortality in NHANES