first_steps

Last updated: 2020-03-21

Checks: 7 0

Knit directory: semb/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200213)

The command set.seed(20200213) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: bf153bc

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/.ipynb_checkpoints/
    Untracked:  analysis/first_steps.Rmd
    Untracked:  code/for_wflow_cmds.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	bf153bc	githubz0r	2020-03-21	uploading and updating scripts
html	dc6711c	githubz0r	2020-02-28	Build site.
Rmd	fbfb766	githubz0r	2020-02-28	upload markdown of lm analysis and comparison of silja pancrea and my pancrea

preparation

Load data and packages. We’ll just both the data set with numCon/numEnds annotated by Silja and the one annotated by me.

library(tidyverse)

── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

✔ ggplot2 3.2.1     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   1.0.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0

── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(irlba)

Loading required package: Matrix


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

library(cowplot)


********************************************************

Note: As of version 1.0.0, cowplot does not change the

  default ggplot2 theme anymore. To recover the previous

  behavior, execute:
  theme_set(theme_cowplot())

********************************************************

pancrea_holes_all <- read.csv('/Users/Deathvoodoo/Documents/semb_int/TableTilLars.csv')
pancrea_holes_altered <- read.csv('/Users/Deathvoodoo/Documents/semb_int/table_2_full.csv')
pancrea_holes_1 <- pancrea_holes_all[names(pancrea_holes_all)!='hole_area_raw'] %>% na.omit
pancrea_holes_2 <- pancrea_holes_altered[names(pancrea_holes_altered)!='hole_area_raw'] %>% na.omit

analysis

First some plots We just quickly visualize dAdt as a function of each predictor.

predictors_1 <- c("numCon", "numEnds", "period", "hole_area_smooth", "eccentricity", "solidity")
predictors_2 <- c("numCon", "numEnds", "period", "hole_area_smooth", "eccentricity", "solidity") # same as predictors 1 but originally meant to include time
#predictors_2 <- c("numCon", "numEnds", "period", "hole_area_smooth", "eccentricity", "solidity", "time_point")
pancrea_1_norm <- pancrea_holes_1 %>% lapply(function(x){if (x[1] %>% is.numeric()) {x/sd(x)} else{x}}) %>% as.data.frame()
pancrea_2_norm <- pancrea_holes_2 %>% lapply(function(x){if (x[1] %>% is.numeric()) {x/sd(x)} else{x}}) %>% as.data.frame()

pancrea_1_norm_long <- pancrea_1_norm %>% pivot_longer(cols=predictors_1, names_to='predictors')
pancrea_2_norm_long <- pancrea_2_norm %>% pivot_longer(cols=predictors_2, names_to='predictors')

pancrea_1_norm_long %>% ggplot(aes(x=value, y=dAdt, col=predictors))+geom_point(size=0.2, alpha=0.5)

Version	Author	Date
dc6711c	githubz0r	2020-02-28

pancrea_2_norm_long %>% ggplot(aes(x=value, y=dAdt, col=predictors))+geom_point(size=0.2, alpha=0.5)

Version	Author	Date
dc6711c	githubz0r	2020-02-28

Now we do some PCA plots, with and without scaling the variances. We color both by hole and dataset to get an overview of how our points are lying together.

DoPCA <- function(data.mat, predictor.names, make.plots=T, get.percent.variance=T, scale=T){
  panc.pca <- prcomp(data.mat[predictor.names], scale.=scale)
  if(make.plots){
    variances <- panc.pca$sdev^2
    PC.strings <- paste0('PC', '', 1:length(predictor.names))
    col.names <- paste0(PC.strings, ", % variance: ", 100*round(variances/sum(variances), digits = 4))
    panc.pca.df <- setNames(panc.pca$x %>% as.data.frame, col.names) %>% bind_cols(data.mat[setdiff(names(data.mat), predictor.names)])
    phole.scaled <- panc.pca.df %>% ggplot(aes(x=!!as.symbol(col.names[1]), y=!!as.symbol(col.names[2]), col=as.factor(hole_num_rep),
                                               shape=data_set_name))+geom_point(alpha=0.5)+guides(col = guide_legend(ncol = 3))
    pdataset.scaled <- panc.pca.df %>% ggplot(aes(x=!!as.symbol(col.names[1]), y=!!as.symbol(col.names[2]), col=data_set_name))+
      geom_point(alpha=0.5)+guides(col = guide_legend(ncol = 3))
  }
  return(list(panc.pca=panc.pca, phole.scaled=phole.scaled, pdataset.scaled=pdataset.scaled))
}

pancrea_1_pca <- DoPCA(pancrea_holes_1, predictors_1)
pancrea_2_pca <- DoPCA(pancrea_holes_2, predictors_2)

pancrea_1_pca$phole.scaled

Version	Author	Date
dc6711c	githubz0r	2020-02-28

pancrea_1_pca$pdataset.scaled

Version	Author	Date
dc6711c	githubz0r	2020-02-28

pancrea_2_pca$phole.scaled

Version	Author	Date
dc6711c	githubz0r	2020-02-28

pancrea_2_pca$pdataset.scaled

Version	Author	Date
dc6711c	githubz0r	2020-02-28

ggsave('/Users/Deathvoodoo/Documents/semb_int/pca_hole.png', pancrea_1_pca$phole.scaled)

Saving 7 x 5 in image

ggsave('/Users/Deathvoodoo/Documents/semb_int/pca_dataset.png', pancrea_1_pca$pdataset.scaled)

Saving 7 x 5 in image

Check weights of the features in the rotation matrices

predictors_2 %>% setNames(1:length(predictors_2), .)

          numCon          numEnds           period hole_area_smooth 
               1                2                3                4 
    eccentricity         solidity 
               5                6

pancrea_1_pca$panc.pca$rotation

                        PC1         PC2         PC3         PC4
numCon           -0.4144738  0.49053188 -0.08067353  0.08576020
numEnds           0.1312448  0.64677488 -0.23746966 -0.62105743
period           -0.3038964  0.02637844 -0.79051662  0.42621626
hole_area_smooth -0.5044041  0.27287369  0.38812584  0.06715059
eccentricity     -0.3329065 -0.48612903 -0.31201869 -0.63853288
solidity          0.5944602  0.17199831 -0.25334974  0.11418898
                         PC5         PC6
numCon           -0.74209386 -0.15173534
numEnds           0.26054476  0.23338661
period            0.30440788  0.08780756
hole_area_smooth  0.52737600 -0.48768855
eccentricity     -0.09934392 -0.37135254
solidity         -0.02746487 -0.73420302

pancrea_2_pca$panc.pca$rotation

                        PC1        PC2         PC3         PC4         PC5
numCon           -0.5228088 -0.2554301 -0.07125109  0.01142662 -0.80988975
numEnds          -0.2345789 -0.6083004 -0.23217188  0.54044425  0.37759193
period           -0.2065639  0.3559548 -0.88516007 -0.17855650  0.09803131
hole_area_smooth -0.5514703 -0.1930589  0.11248327 -0.34771580  0.38905325
eccentricity     -0.2609432  0.5384805  0.10287140  0.72223502 -0.00730071
solidity          0.5067190 -0.3328503 -0.36643709  0.18269460 -0.20118911
                         PC6
numCon           -0.01725596
numEnds          -0.29391147
period           -0.07501102
hole_area_smooth  0.61129987
eccentricity      0.33119196
solidity          0.65139541

We check distribution of dAdt visually and with shapiro wilk test. Of course, for fitting a linear model (with gaussian noise) we assume that the data is normally distributed, so it better be.

hist(pancrea_holes_1$dAdt, breaks = 50, prob=T)
lines(density(pancrea_holes_1$dAdt), col = "blue")

Version	Author	Date
dc6711c	githubz0r	2020-02-28

shapiro.test(pancrea_holes_1$dAdt)


    Shapiro-Wilk normality test

data:  pancrea_holes_1$dAdt
W = 0.9189, p-value < 2.2e-16

Check correlation between variables, both as scatter plot and heat map (get different color scheme with two colors)

MakeCorrelationPlots <- function(data.matrix){
  corr.mat <- cor(data.matrix)
  corr.values <- corr.mat; corr.values[upper.tri(corr.values, diag=T)] <- NA
  corr.values <- corr.values %>% reshape2::melt() %>% na.omit %>% transmute(which_factors=paste0(Var1,"+", Var2), correlation=value)
  pcor.scatter <- corr.values %>% ggplot(aes(x=which_factors, y=correlation))+geom_point()+theme(axis.text.x = element_text(angle = 90, hjust = 1))
  pcor.heat <- corr.mat %>% reshape2::melt() %>% ggplot(aes(x=Var1, y=Var2, fill=value))+geom_tile()+geom_text(aes(label = round(value, 2)))+
    scale_fill_gradient2(low = 'blue', high = 'red')+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  return(list(pcor.scatter=pcor.scatter, pcor.heat=pcor.heat))
}
corplot_1 <- MakeCorrelationPlots(pancrea_holes_1[c('dAdt', predictors_1)])
corplot_2 <- MakeCorrelationPlots(pancrea_holes_2[c('dAdt', predictors_2)])

corplot_1$pcor.scatter

Version	Author	Date
dc6711c	githubz0r	2020-02-28

corplot_1$pcor.heat

Version	Author	Date
dc6711c	githubz0r	2020-02-28

corplot_2$pcor.scatter

Version	Author	Date
dc6711c	githubz0r	2020-02-28

corplot_2$pcor.heat

Version	Author	Date
dc6711c	githubz0r	2020-02-28

#ggsave('/Users/Deathvoodoo/Documents/semb_int/correlation_matrix.png', corplot_1$pcor.heat)

How did the area of the holes change over time again (using my annotated time points) ?

pancrea_area_df <- pancrea_holes_2[c('data_set_name', 'hole_num_rep', 'hole_num_inmovie', 'hole_area_smooth', 'time_point')]
area_time_1 <- pancrea_area_df %>% filter(data_set_name=='pos1') %>% ggplot(aes(x=time_point, y=hole_area_smooth, col=as.factor(hole_num_inmovie)))+geom_point(size=0.2, alpha=0.5)
area_time_2 <-pancrea_area_df %>% filter(data_set_name=='pos2') %>% ggplot(aes(x=time_point, y=hole_area_smooth, col=as.factor(hole_num_inmovie)))+geom_point(size=0.2, alpha=0.5)
area_time_3 <-pancrea_area_df %>% ggplot(aes(x=time_point, y=hole_area_smooth, col=as.factor(hole_num_inmovie)))+geom_point(size=0.2, alpha=0.5) + 
  facet_wrap(~data_set_name, scale='free') + guides(col = guide_legend(ncol = 2))
area_time_3

#ggsave('/Users/Deathvoodoo/Documents/semb_int/area_time1.png', area_time_1)
#ggsave('/Users/Deathvoodoo/Documents/semb_int/area_time2.png', area_time_2)
#ggsave('/Users/Deathvoodoo/Documents/semb_int/area_time.png', area_time_3, width = 10)

Also we can check the distributions of each predictor variable.

pancrea_holes_1[predictors_1] %>% pivot_longer(predictors_1, names_to = 'predictors') %>% ggplot(aes(x=value))+geom_histogram()+facet_wrap(~predictors, scale='free')

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We may have some reservations about period, let’s try to check the average period of each hole as a function of the number of data points of the hole (as a substitute for life-time)

mean_period <- pancrea_holes_1 %>% group_by(hole_num_rep) %>% summarize(period_mean=mean(period), nrow_holes=as.numeric(table(hole_num_rep)))
mean_period %>% ggplot(aes(x=as.factor(nrow_holes), y=period_mean, col=as.factor(hole_num_rep)))+geom_point()+theme(axis.text.x = element_text(angle = -90, hjust = 1))

Version	Author	Date
dc6711c	githubz0r	2020-02-28

#mean_dadt <- pancrea_holes_1 %>% group_by(hole_num_rep) %>% summarize(dadt_mean=mean(dAdt), nrow_holes=as.numeric(table(hole_num_rep)), area_mean=mean(hole_area_smooth))
#mean_dadt %>% ggplot(aes(x=area_mean, y=dadt_mean, col=as.factor(hole_num_rep)))+geom_point()+theme(axis.text.x = element_text(angle = -90, hjust = 1))

Now we start fitting some models. We first try all the variables. Note that we are omitting dEdt because Silja said it’s not important and also numEnds_over_numCon since it’s very highly correlation with numEnds

cor(pancrea_holes_1$numEnds, pancrea_holes_1$numEnds_over_numCon)

[1] 0.9663755

The third model will be including the time, i.e. the time in the video annotated by me.

linmodel_1 <- lm(dAdt~numCon + numEnds + period + hole_area_smooth + eccentricity + solidity, data = pancrea_holes_1)
linmodel_2 <- lm(dAdt~numCon + numEnds + period + hole_area_smooth + eccentricity + solidity, data = pancrea_holes_2)
linmodel_time <- lm(dAdt~numCon + numEnds + period + hole_area_smooth + eccentricity + solidity + time_point, data = pancrea_holes_2)

Diagnostic plots of the models. Things definitely do not look iid/evenly distribution noise/homoskedastic

plot(linmodel_1)

plot(linmodel_2)

plot(linmodel_time)

We print some summaries of the models. This gives some various statistical info, such as e.g. weight estimates, p values when doing some kind of t-test of the weights (with null hypothesis that they are equal to zero). This gives similar values to doing a likelihood ratio test where the nested model lacks that coefficient. Note that the models are all complete trash, judging by the R^2 values.

summary(linmodel_1)


Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = pancrea_holes_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-248.653  -18.581    2.319   19.959  182.195 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -6.555e+01  2.537e+01  -2.584 0.009820 ** 
numCon           -6.016e+00  7.448e-01  -8.077 9.48e-16 ***
numEnds          -6.282e+00  8.174e-01  -7.685 2.04e-14 ***
period            2.127e-01  1.750e-01   1.215 0.224327    
hole_area_smooth  4.393e-03  4.287e-04  10.247  < 2e-16 ***
eccentricity     -2.789e+01  8.026e+00  -3.475 0.000517 ***
solidity          9.887e+01  2.330e+01   4.244 2.26e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.51 on 3063 degrees of freedom
Multiple R-squared:  0.07249,   Adjusted R-squared:  0.07067 
F-statistic:  39.9 on 6 and 3063 DF,  p-value: < 2.2e-16

summary(linmodel_2)


Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = pancrea_holes_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-251.242  -20.044    0.096   21.088  186.804 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -4.368e+01  2.592e+01  -1.685  0.09201 .  
numCon           -2.650e+00  8.188e-01  -3.237  0.00122 ** 
numEnds           1.523e+00  7.032e-01   2.166  0.03041 *  
period           -5.143e-02  1.762e-01  -0.292  0.77033    
hole_area_smooth  3.141e-03  5.047e-04   6.223 5.55e-10 ***
eccentricity     -1.873e+01  8.219e+00  -2.279  0.02276 *  
solidity          5.449e+01  2.391e+01   2.279  0.02275 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.53 on 3063 degrees of freedom
Multiple R-squared:  0.02545,   Adjusted R-squared:  0.02355 
F-statistic: 13.33 on 6 and 3063 DF,  p-value: 5.671e-15

summary(linmodel_time)


Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity + time_point, data = pancrea_holes_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-251.898  -19.883    0.044   21.039  186.622 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -4.396e+01  2.592e+01  -1.696  0.09002 .  
numCon           -2.357e+00  8.871e-01  -2.657  0.00793 ** 
numEnds           1.572e+00  7.056e-01   2.228  0.02593 *  
period           -3.726e-02  1.769e-01  -0.211  0.83323    
hole_area_smooth  3.041e-03  5.178e-04   5.873 4.74e-09 ***
eccentricity     -1.799e+01  8.264e+00  -2.177  0.02955 *  
solidity          5.372e+01  2.393e+01   2.245  0.02485 *  
time_point       -1.194e-02  1.389e-02  -0.860  0.39011    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.53 on 3062 degrees of freedom
Multiple R-squared:  0.02569,   Adjusted R-squared:  0.02346 
F-statistic: 11.53 on 7 and 3062 DF,  p-value: 1.546e-14

Just testing linmodel_time vs the nested model linmodel_2 (and we can also do some aic). We see that time basically sucks.

anova(linmodel_2, linmodel_time)

Analysis of Variance Table

Model 1: dAdt ~ numCon + numEnds + period + hole_area_smooth + eccentricity + 
    solidity
Model 2: dAdt ~ numCon + numEnds + period + hole_area_smooth + eccentricity + 
    solidity + time_point
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1   3063 5282761                           
2   3062 5281486  1    1274.3 0.7388 0.3901

AIC(linmodel_2) - AIC(linmodel_time) # should be positive if linmodel 3 is better

[1] -1.259347

Is our model even better than just random shit? We can make some random data frames and see.

RandomizeDataframe <- function(df){
  df.random <- df %>% lapply(function(x){sample(x)}) %>% as.data.frame
  return(df.random)
}

random_pancreas <- replicate(100, RandomizeDataframe(pancrea_holes_1), simplify = FALSE)

Let’s test the random data frames

random_results <- random_pancreas %>% lapply(function(x){
  linmodel.random <- lm(dAdt~numCon + numEnds + period + hole_area_smooth + eccentricity + solidity, data = x)
  random.lm.summary <- summary(linmodel.random)
  residual.standard.error <- random.lm.summary$sigma
  r.squared <- random.lm.summary$r.squared
  r.squared.adjusted <- random.lm.summary$adj.r.squared
  sum.squared.resids <- sum(residuals(linmodel.random)^2)
  return(c(residual.standard.error=residual.standard.error, r.squared=r.squared, 
           r.squared.adjusted=r.squared.adjusted, sum.squared.resids=sum.squared.resids))
})

output_values_random <-random_results[[1]] %>% names %>% lapply(function(x){
  value.for.all.res <- random_results %>% lapply(function(result.vector){result.vector[[x]] %>% unname})
  return(value.for.all.res)
}) %>% lapply(unlist); output_values_random %<>% setNames(random_results[[1]] %>% names)

output_values_random$residual.standard.error %>% qplot

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output_values_random$r.squared %>% qplot

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output_values_random$r.squared.adjusted %>% qplot

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output_values_random$sum.squared.resids %>% qplot

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We compare mean residual standard error, mean r squared, mean r squared adjusted and the mean sum of squared residuals from randomized data frames with the values for the first linear model. It’s a little bit better.

output_values_random %>% lapply(mean) %>% unlist

residual.standard.error               r.squared      r.squared.adjusted 
           4.202395e+01            2.111187e-03            1.564589e-04 
     sum.squared.resids 
           5.409299e+06

summary(linmodel_1)


Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = pancrea_holes_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-248.653  -18.581    2.319   19.959  182.195 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -6.555e+01  2.537e+01  -2.584 0.009820 ** 
numCon           -6.016e+00  7.448e-01  -8.077 9.48e-16 ***
numEnds          -6.282e+00  8.174e-01  -7.685 2.04e-14 ***
period            2.127e-01  1.750e-01   1.215 0.224327    
hole_area_smooth  4.393e-03  4.287e-04  10.247  < 2e-16 ***
eccentricity     -2.789e+01  8.026e+00  -3.475 0.000517 ***
solidity          9.887e+01  2.330e+01   4.244 2.26e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.51 on 3063 degrees of freedom
Multiple R-squared:  0.07249,   Adjusted R-squared:  0.07067 
F-statistic:  39.9 on 6 and 3063 DF,  p-value: < 2.2e-16

sum(residuals(linmodel_1)^2)

[1] 5027797

Now we can do some stepwise linear regression with different R packages, using AIC/BIC as criteria.

library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

library(leaps)

First using MASS

step_mass_bi <- stepAIC(linmodel_1, direction='both')

Start:  AIC=22735.25
dAdt ~ numCon + numEnds + period + hole_area_smooth + eccentricity + 
    solidity

                   Df Sum of Sq     RSS   AIC
- period            1      2425 5030221 22735
<none>                          5027797 22735
- eccentricity      1     19826 5047623 22745
- solidity          1     29564 5057360 22751
- numEnds           1     96948 5124744 22792
- numCon            1    107075 5134871 22798
- hole_area_smooth  1    172367 5200163 22837

Step:  AIC=22734.73
dAdt ~ numCon + numEnds + hole_area_smooth + eccentricity + solidity

                   Df Sum of Sq     RSS   AIC
<none>                          5030221 22735
+ period            1      2425 5027797 22735
- eccentricity      1     17873 5048094 22744
- solidity          1     30451 5060672 22751
- numEnds           1     98933 5129155 22792
- numCon            1    105419 5135640 22796
- hole_area_smooth  1    174482 5204703 22837

summary(step_mass_bi)


Call:
lm(formula = dAdt ~ numCon + numEnds + hole_area_smooth + eccentricity + 
    solidity, data = pancrea_holes_1)

Residuals:
    Min      1Q  Median      3Q     Max 
-248.81  -18.10    1.92   19.81  183.92 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -6.441e+01  2.535e+01  -2.540  0.01112 *  
numCon           -5.805e+00  7.244e-01  -8.013 1.57e-15 ***
numEnds          -6.337e+00  8.163e-01  -7.763 1.12e-14 ***
hole_area_smooth  4.416e-03  4.284e-04  10.309  < 2e-16 ***
eccentricity     -2.596e+01  7.867e+00  -3.299  0.00098 ***
solidity          1.002e+02  2.327e+01   4.307 1.71e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.52 on 3064 degrees of freedom
Multiple R-squared:  0.07204,   Adjusted R-squared:  0.07053 
F-statistic: 47.57 on 5 and 3064 DF,  p-value: < 2.2e-16

We can try putting in a model with second order interactions and see what happens.

step_mass_squared_bi <- stepAIC(lm(dAdt~.^2, data = pancrea_holes_1[c('dAdt', predictors_1)]), direction='both')

Start:  AIC=22280.89
dAdt ~ (numCon + numEnds + period + hole_area_smooth + eccentricity + 
    solidity)^2

                                Df Sum of Sq     RSS   AIC
- numEnds:hole_area_smooth       1       405 4294365 22279
- numEnds:solidity               1       569 4294529 22279
- period:hole_area_smooth        1      1908 4295868 22280
<none>                                       4293960 22281
- numEnds:period                 1      3012 4296972 22281
- numEnds:eccentricity           1      4029 4297989 22282
- numCon:period                  1     15649 4309609 22290
- period:solidity                1     16108 4310068 22290
- period:eccentricity            1     17205 4311164 22291
- numCon:eccentricity            1     18125 4312085 22292
- eccentricity:solidity          1     22362 4316321 22295
- numCon:solidity                1     63673 4357633 22324
- numCon:numEnds                 1    102497 4396456 22351
- numCon:hole_area_smooth        1    109613 4403572 22356
- hole_area_smooth:eccentricity  1    184435 4478395 22408
- hole_area_smooth:solidity      1    327369 4621328 22504

Step:  AIC=22279.18
dAdt ~ numCon + numEnds + period + hole_area_smooth + eccentricity + 
    solidity + numCon:numEnds + numCon:period + numCon:hole_area_smooth + 
    numCon:eccentricity + numCon:solidity + numEnds:period + 
    numEnds:eccentricity + numEnds:solidity + period:hole_area_smooth + 
    period:eccentricity + period:solidity + hole_area_smooth:eccentricity + 
    hole_area_smooth:solidity + eccentricity:solidity

                                Df Sum of Sq     RSS   AIC
- numEnds:solidity               1       178 4294543 22277
- period:hole_area_smooth        1      2131 4296496 22279
<none>                                       4294365 22279
- numEnds:period                 1      3436 4297801 22280
- numEnds:eccentricity           1      3643 4298008 22280
+ numEnds:hole_area_smooth       1       405 4293960 22281
- numCon:period                  1     15244 4309609 22288
- period:solidity                1     16700 4311065 22289
- period:eccentricity            1     17304 4311669 22290
- numCon:eccentricity            1     18882 4313247 22291
- eccentricity:solidity          1     22077 4316442 22293
- numCon:solidity                1     63271 4357636 22322
- numCon:numEnds                 1    106057 4400422 22352
- numCon:hole_area_smooth        1    114792 4409157 22358
- hole_area_smooth:eccentricity  1    184369 4478734 22406
- hole_area_smooth:solidity      1    334495 4628859 22508

Step:  AIC=22277.31
dAdt ~ numCon + numEnds + period + hole_area_smooth + eccentricity + 
    solidity + numCon:numEnds + numCon:period + numCon:hole_area_smooth + 
    numCon:eccentricity + numCon:solidity + numEnds:period + 
    numEnds:eccentricity + period:hole_area_smooth + period:eccentricity + 
    period:solidity + hole_area_smooth:eccentricity + hole_area_smooth:solidity + 
    eccentricity:solidity

                                Df Sum of Sq     RSS   AIC
- period:hole_area_smooth        1      2258 4296801 22277
<none>                                       4294543 22277
- numEnds:eccentricity           1      3523 4298067 22278
- numEnds:period                 1      5087 4299630 22279
+ numEnds:solidity               1       178 4294365 22279
+ numEnds:hole_area_smooth       1        14 4294529 22279
- numCon:period                  1     15158 4309701 22286
- period:solidity                1     17073 4311616 22288
- period:eccentricity            1     17716 4312259 22288
- numCon:eccentricity            1     19816 4314359 22289
- eccentricity:solidity          1     22158 4316701 22291
- numCon:solidity                1     70592 4365136 22325
- numCon:hole_area_smooth        1    115390 4409933 22357
- numCon:numEnds                 1    120313 4414856 22360
- hole_area_smooth:eccentricity  1    185186 4479729 22405
- hole_area_smooth:solidity      1    348112 4642656 22515

Step:  AIC=22276.92
dAdt ~ numCon + numEnds + period + hole_area_smooth + eccentricity + 
    solidity + numCon:numEnds + numCon:period + numCon:hole_area_smooth + 
    numCon:eccentricity + numCon:solidity + numEnds:period + 
    numEnds:eccentricity + period:eccentricity + period:solidity + 
    hole_area_smooth:eccentricity + hole_area_smooth:solidity + 
    eccentricity:solidity

                                Df Sum of Sq     RSS   AIC
<none>                                       4296801 22277
- numEnds:eccentricity           1      2898 4299699 22277
+ period:hole_area_smooth        1      2258 4294543 22277
- numEnds:period                 1      4599 4301400 22278
+ numEnds:solidity               1       306 4296496 22279
+ numEnds:hole_area_smooth       1        18 4296784 22279
- period:solidity                1     15117 4311918 22286
- numCon:period                  1     15498 4312299 22286
- period:eccentricity            1     15779 4312580 22286
- eccentricity:solidity          1     21183 4317984 22290
- numCon:eccentricity            1     22487 4319288 22291
- numCon:solidity                1     68903 4365704 22324
- numCon:numEnds                 1    119991 4416792 22360
- numCon:hole_area_smooth        1    146145 4442946 22378
- hole_area_smooth:eccentricity  1    183921 4480722 22404
- hole_area_smooth:solidity      1    347521 4644322 22514

summary(step_mass_squared_bi)


Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity + numCon:numEnds + numCon:period + 
    numCon:hole_area_smooth + numCon:eccentricity + numCon:solidity + 
    numEnds:period + numEnds:eccentricity + period:eccentricity + 
    period:solidity + hole_area_smooth:eccentricity + hole_area_smooth:solidity + 
    eccentricity:solidity, data = pancrea_holes_1[c("dAdt", predictors_1)])

Residuals:
    Min      1Q  Median      3Q     Max 
-209.37  -18.57    2.34   18.93  175.63 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    1.373e+02  2.107e+02   0.652 0.514734    
numCon                        -1.495e+02  3.063e+01  -4.882 1.10e-06 ***
numEnds                        3.383e+01  9.868e+00   3.428 0.000616 ***
period                         1.785e+01  6.351e+00   2.810 0.004989 ** 
hole_area_smooth               2.306e-01  1.363e-02  16.918  < 2e-16 ***
eccentricity                  -3.109e+02  1.857e+02  -1.674 0.094143 .  
solidity                      -5.048e+02  2.140e+02  -2.359 0.018372 *  
numCon:numEnds                -1.088e+01  1.179e+00  -9.230  < 2e-16 ***
numCon:period                  1.091e+00  3.289e-01   3.317 0.000920 ***
numCon:hole_area_smooth       -4.652e-03  4.566e-04 -10.187  < 2e-16 ***
numCon:eccentricity           -4.019e+01  1.006e+01  -3.996 6.60e-05 ***
numCon:solidity                1.952e+02  2.791e+01   6.995 3.25e-12 ***
numEnds:period                 4.589e-01  2.539e-01   1.807 0.070844 .  
numEnds:eccentricity          -1.486e+01  1.036e+01  -1.434 0.151553    
period:eccentricity           -5.812e+00  1.736e+00  -3.347 0.000826 ***
period:solidity               -1.880e+01  5.739e+00  -3.276 0.001064 ** 
hole_area_smooth:eccentricity -6.308e-02  5.520e-03 -11.428  < 2e-16 ***
hole_area_smooth:solidity     -1.796e-01  1.143e-02 -15.709  < 2e-16 ***
eccentricity:solidity          7.934e+02  2.046e+02   3.878 0.000107 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.53 on 3051 degrees of freedom
Multiple R-squared:  0.2073,    Adjusted R-squared:  0.2027 
F-statistic: 44.34 on 18 and 3051 DF,  p-value: < 2.2e-16

Gives better adjusted R squared fo sho.

Using leaps package

models_leap <- regsubsets(dAdt~., data = pancrea_holes_1[c('dAdt', predictors_1)], nvmax = 7,
                     method = "seqrep")

summary_leap <- summary(models_leap)
summary_leap

Subset selection object
Call: regsubsets.formula(dAdt ~ ., data = pancrea_holes_1[c("dAdt", 
    predictors_1)], nvmax = 7, method = "seqrep")
6 Variables  (and intercept)
                 Forced in Forced out
numCon               FALSE      FALSE
numEnds              FALSE      FALSE
period               FALSE      FALSE
hole_area_smooth     FALSE      FALSE
eccentricity         FALSE      FALSE
solidity             FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: 'sequential replacement'
         numCon numEnds period hole_area_smooth eccentricity solidity
1  ( 1 ) " "    "*"     " "    " "              " "          " "     
2  ( 1 ) "*"    " "     " "    "*"              " "          " "     
3  ( 1 ) "*"    "*"     "*"    " "              " "          " "     
4  ( 1 ) "*"    "*"     " "    "*"              " "          "*"     
5  ( 1 ) "*"    "*"     " "    "*"              "*"          "*"     
6  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"

summary_leap$bic

[1]  -36.07497 -117.79142  -56.77597 -178.50501 -181.36415 -174.81480

Doing both forwards and backwards just because.

models_forward <- regsubsets(dAdt~., data = pancrea_holes_1[c('dAdt', predictors_1)], nvmax = 7,
                     method = "forward")

summary_forward <- summary(models_forward)
summary_forward

Subset selection object
Call: regsubsets.formula(dAdt ~ ., data = pancrea_holes_1[c("dAdt", 
    predictors_1)], nvmax = 7, method = "forward")
6 Variables  (and intercept)
                 Forced in Forced out
numCon               FALSE      FALSE
numEnds              FALSE      FALSE
period               FALSE      FALSE
hole_area_smooth     FALSE      FALSE
eccentricity         FALSE      FALSE
solidity             FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: forward
         numCon numEnds period hole_area_smooth eccentricity solidity
1  ( 1 ) " "    "*"     " "    " "              " "          " "     
2  ( 1 ) "*"    "*"     " "    " "              " "          " "     
3  ( 1 ) "*"    "*"     " "    "*"              " "          " "     
4  ( 1 ) "*"    "*"     " "    "*"              " "          "*"     
5  ( 1 ) "*"    "*"     " "    "*"              "*"          "*"     
6  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"

summary_forward$bic

[1]  -36.07497  -64.65230 -141.25660 -178.50501 -181.36415 -174.81480

models_backward <- regsubsets(dAdt~., data = pancrea_holes_1[c('dAdt', predictors_1)], nvmax = 7,
                     method = "backward")

summary_backward <- summary(models_backward)
summary_backward

Subset selection object
Call: regsubsets.formula(dAdt ~ ., data = pancrea_holes_1[c("dAdt", 
    predictors_1)], nvmax = 7, method = "backward")
6 Variables  (and intercept)
                 Forced in Forced out
numCon               FALSE      FALSE
numEnds              FALSE      FALSE
period               FALSE      FALSE
hole_area_smooth     FALSE      FALSE
eccentricity         FALSE      FALSE
solidity             FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         numCon numEnds period hole_area_smooth eccentricity solidity
1  ( 1 ) "*"    " "     " "    " "              " "          " "     
2  ( 1 ) "*"    " "     " "    "*"              " "          " "     
3  ( 1 ) "*"    "*"     " "    "*"              " "          " "     
4  ( 1 ) "*"    "*"     " "    "*"              " "          "*"     
5  ( 1 ) "*"    "*"     " "    "*"              "*"          "*"     
6  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"

summary_backward$bic

[1]  -35.76123 -117.79142 -141.25660 -178.50501 -181.36415 -174.81480

We can try higher order model, see wut happens.

Nsquaredpermutations <- function(n.features){ # since the regsubsets function has a max number of variables, we just put in the maximum number of possible variables of 2nd order
  return((n.features^2-n.features)/2 + n.features + 1)
}
models_leap_square <- regsubsets(dAdt~.^2, data = pancrea_holes_1[c('dAdt', predictors_1)], nvmax = Nsquaredpermutations(length(predictors_1)),
                     method = "seqrep")

summary_leap_square <- summary(models_leap_square)
summary_leap_square

Subset selection object
Call: regsubsets.formula(dAdt ~ .^2, data = pancrea_holes_1[c("dAdt", 
    predictors_1)], nvmax = Nsquaredpermutations(length(predictors_1)), 
    method = "seqrep")
21 Variables  (and intercept)
                              Forced in Forced out
numCon                            FALSE      FALSE
numEnds                           FALSE      FALSE
period                            FALSE      FALSE
hole_area_smooth                  FALSE      FALSE
eccentricity                      FALSE      FALSE
solidity                          FALSE      FALSE
numCon:numEnds                    FALSE      FALSE
numCon:period                     FALSE      FALSE
numCon:hole_area_smooth           FALSE      FALSE
numCon:eccentricity               FALSE      FALSE
numCon:solidity                   FALSE      FALSE
numEnds:period                    FALSE      FALSE
numEnds:hole_area_smooth          FALSE      FALSE
numEnds:eccentricity              FALSE      FALSE
numEnds:solidity                  FALSE      FALSE
period:hole_area_smooth           FALSE      FALSE
period:eccentricity               FALSE      FALSE
period:solidity                   FALSE      FALSE
hole_area_smooth:eccentricity     FALSE      FALSE
hole_area_smooth:solidity         FALSE      FALSE
eccentricity:solidity             FALSE      FALSE
1 subsets of each size up to 21
Selection Algorithm: 'sequential replacement'
          numCon numEnds period hole_area_smooth eccentricity solidity
1  ( 1 )  " "    " "     " "    " "              " "          " "     
2  ( 1 )  " "    " "     " "    "*"              " "          " "     
3  ( 1 )  " "    " "     " "    "*"              " "          " "     
4  ( 1 )  " "    " "     " "    "*"              " "          " "     
5  ( 1 )  " "    " "     " "    "*"              " "          " "     
6  ( 1 )  " "    " "     " "    "*"              " "          " "     
7  ( 1 )  " "    " "     " "    "*"              "*"          "*"     
8  ( 1 )  "*"    " "     " "    "*"              "*"          " "     
9  ( 1 )  "*"    " "     " "    "*"              "*"          " "     
10  ( 1 ) " "    " "     " "    "*"              " "          " "     
11  ( 1 ) "*"    "*"     " "    "*"              "*"          " "     
12  ( 1 ) "*"    "*"     " "    "*"              "*"          " "     
13  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"     
14  ( 1 ) "*"    "*"     " "    "*"              " "          "*"     
15  ( 1 ) "*"    "*"     "*"    "*"              " "          " "     
16  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"     
17  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"     
18  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"     
19  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"     
20  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"     
21  ( 1 ) "*"    "*"     "*"    "*"              "*"          "*"     
          numCon:numEnds numCon:period numCon:hole_area_smooth
1  ( 1 )  " "            " "           " "                    
2  ( 1 )  " "            " "           "*"                    
3  ( 1 )  " "            " "           "*"                    
4  ( 1 )  "*"            " "           "*"                    
5  ( 1 )  " "            " "           "*"                    
6  ( 1 )  " "            " "           "*"                    
7  ( 1 )  "*"            " "           "*"                    
8  ( 1 )  "*"            " "           "*"                    
9  ( 1 )  "*"            " "           "*"                    
10  ( 1 ) "*"            " "           "*"                    
11  ( 1 ) "*"            " "           "*"                    
12  ( 1 ) "*"            "*"           "*"                    
13  ( 1 ) "*"            "*"           "*"                    
14  ( 1 ) "*"            "*"           "*"                    
15  ( 1 ) "*"            "*"           "*"                    
16  ( 1 ) "*"            "*"           "*"                    
17  ( 1 ) "*"            "*"           "*"                    
18  ( 1 ) "*"            "*"           "*"                    
19  ( 1 ) "*"            "*"           "*"                    
20  ( 1 ) "*"            "*"           "*"                    
21  ( 1 ) "*"            "*"           "*"                    
          numCon:eccentricity numCon:solidity numEnds:period
1  ( 1 )  "*"                 " "             " "           
2  ( 1 )  " "                 " "             " "           
3  ( 1 )  " "                 " "             " "           
4  ( 1 )  " "                 " "             " "           
5  ( 1 )  " "                 " "             " "           
6  ( 1 )  " "                 " "             " "           
7  ( 1 )  " "                 " "             " "           
8  ( 1 )  " "                 "*"             " "           
9  ( 1 )  " "                 "*"             "*"           
10  ( 1 ) "*"                 "*"             "*"           
11  ( 1 ) "*"                 "*"             "*"           
12  ( 1 ) "*"                 "*"             " "           
13  ( 1 ) "*"                 "*"             "*"           
14  ( 1 ) "*"                 "*"             " "           
15  ( 1 ) "*"                 "*"             "*"           
16  ( 1 ) "*"                 "*"             "*"           
17  ( 1 ) "*"                 "*"             "*"           
18  ( 1 ) "*"                 "*"             "*"           
19  ( 1 ) "*"                 "*"             "*"           
20  ( 1 ) "*"                 "*"             "*"           
21  ( 1 ) "*"                 "*"             "*"           
          numEnds:hole_area_smooth numEnds:eccentricity numEnds:solidity
1  ( 1 )  " "                      " "                  " "             
2  ( 1 )  " "                      " "                  " "             
3  ( 1 )  " "                      "*"                  " "             
4  ( 1 )  " "                      " "                  " "             
5  ( 1 )  "*"                      " "                  " "             
6  ( 1 )  "*"                      " "                  " "             
7  ( 1 )  " "                      " "                  " "             
8  ( 1 )  " "                      " "                  " "             
9  ( 1 )  " "                      " "                  " "             
10  ( 1 ) " "                      " "                  " "             
11  ( 1 ) " "                      " "                  " "             
12  ( 1 ) " "                      " "                  " "             
13  ( 1 ) "*"                      " "                  " "             
14  ( 1 ) " "                      " "                  " "             
15  ( 1 ) " "                      " "                  " "             
16  ( 1 ) "*"                      "*"                  "*"             
17  ( 1 ) "*"                      "*"                  "*"             
18  ( 1 ) " "                      "*"                  " "             
19  ( 1 ) " "                      "*"                  " "             
20  ( 1 ) " "                      "*"                  "*"             
21  ( 1 ) "*"                      "*"                  "*"             
          period:hole_area_smooth period:eccentricity period:solidity
1  ( 1 )  " "                     " "                 " "            
2  ( 1 )  " "                     " "                 " "            
3  ( 1 )  " "                     " "                 " "            
4  ( 1 )  " "                     " "                 " "            
5  ( 1 )  " "                     " "                 " "            
6  ( 1 )  " "                     " "                 " "            
7  ( 1 )  " "                     " "                 " "            
8  ( 1 )  " "                     " "                 " "            
9  ( 1 )  " "                     " "                 " "            
10  ( 1 ) " "                     " "                 "*"            
11  ( 1 ) " "                     " "                 " "            
12  ( 1 ) " "                     " "                 "*"            
13  ( 1 ) " "                     " "                 " "            
14  ( 1 ) " "                     "*"                 "*"            
15  ( 1 ) " "                     "*"                 "*"            
16  ( 1 ) "*"                     " "                 " "            
17  ( 1 ) "*"                     "*"                 " "            
18  ( 1 ) " "                     "*"                 "*"            
19  ( 1 ) "*"                     "*"                 "*"            
20  ( 1 ) "*"                     "*"                 "*"            
21  ( 1 ) "*"                     "*"                 "*"            
          hole_area_smooth:eccentricity hole_area_smooth:solidity
1  ( 1 )  " "                           " "                      
2  ( 1 )  " "                           " "                      
3  ( 1 )  " "                           " "                      
4  ( 1 )  "*"                           " "                      
5  ( 1 )  "*"                           "*"                      
6  ( 1 )  "*"                           "*"                      
7  ( 1 )  "*"                           "*"                      
8  ( 1 )  "*"                           "*"                      
9  ( 1 )  "*"                           "*"                      
10  ( 1 ) "*"                           "*"                      
11  ( 1 ) "*"                           "*"                      
12  ( 1 ) "*"                           "*"                      
13  ( 1 ) " "                           " "                      
14  ( 1 ) "*"                           "*"                      
15  ( 1 ) "*"                           "*"                      
16  ( 1 ) " "                           " "                      
17  ( 1 ) " "                           " "                      
18  ( 1 ) "*"                           "*"                      
19  ( 1 ) "*"                           "*"                      
20  ( 1 ) "*"                           "*"                      
21  ( 1 ) "*"                           "*"                      
          eccentricity:solidity
1  ( 1 )  " "                  
2  ( 1 )  " "                  
3  ( 1 )  " "                  
4  ( 1 )  " "                  
5  ( 1 )  " "                  
6  ( 1 )  "*"                  
7  ( 1 )  " "                  
8  ( 1 )  " "                  
9  ( 1 )  " "                  
10  ( 1 ) "*"                  
11  ( 1 ) " "                  
12  ( 1 ) " "                  
13  ( 1 ) " "                  
14  ( 1 ) "*"                  
15  ( 1 ) "*"                  
16  ( 1 ) " "                  
17  ( 1 ) " "                  
18  ( 1 ) "*"                  
19  ( 1 ) "*"                  
20  ( 1 ) "*"                  
21  ( 1 ) "*"

summary_leap_square$bic

 [1]  -63.54805 -174.54963 -224.02449 -302.06565 -334.92085 -408.11345
 [7] -427.22907 -505.92208 -543.77321 -573.70956 -579.42099 -580.28656
[13] -310.60661 -580.22657 -577.98218 -290.76927 -283.17979 -560.79231
[19] -554.37669 -546.47464 -538.73503

I.e. the interaction model with the lowest bic is

which(summary_leap_square$bic==min(summary_leap_square$bic))

[1] 12

We extract the variables of this model and put it into lm

interaction_str <- summary_leap_square$outmat[12, ][summary_leap_square$outmat[12, ]=='*'] %>% names %>% paste(., collapse='+') %>% paste0('dAdt~','',.)
linmodel_interaction_leaps <- lm(as.formula(interaction_str), data=pancrea_holes_1)

summary(linmodel_interaction_leaps)


Call:
lm(formula = as.formula(interaction_str), data = pancrea_holes_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-211.174  -18.369    2.351   18.954  174.924 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -2.245e+02  3.058e+01  -7.343 2.67e-13 ***
numCon                        -1.090e+02  1.578e+01  -6.909 5.93e-12 ***
numEnds                        2.507e+01  4.147e+00   6.044 1.68e-09 ***
hole_area_smooth               2.380e-01  1.338e-02  17.784  < 2e-16 ***
eccentricity                   3.067e+02  2.980e+01  10.291  < 2e-16 ***
numCon:numEnds                -8.939e+00  9.370e-01  -9.540  < 2e-16 ***
numCon:period                  1.298e+00  2.413e-01   5.381 7.97e-08 ***
numCon:hole_area_smooth       -4.907e-03  3.926e-04 -12.498  < 2e-16 ***
numCon:eccentricity           -4.422e+01  7.469e+00  -5.921 3.55e-09 ***
numCon:solidity                1.471e+02  1.162e+01  12.655  < 2e-16 ***
period:solidity               -4.769e+00  9.869e-01  -4.832 1.42e-06 ***
hole_area_smooth:eccentricity -7.513e-02  4.814e-03 -15.606  < 2e-16 ***
hole_area_smooth:solidity     -1.761e-01  1.105e-02 -15.932  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.67 on 3057 degrees of freedom
Multiple R-squared:  0.1999,    Adjusted R-squared:  0.1968 
F-statistic: 63.65 on 12 and 3057 DF,  p-value: < 2.2e-16

AIC(linmodel_interaction_leaps)

[1] 31007.89

BIC(linmodel_interaction_leaps)

[1] 31092.3

plot(linmodel_interaction_leaps)

Version	Author	Date
dc6711c	githubz0r	2020-02-28

Version	Author	Date
dc6711c	githubz0r	2020-02-28

Yeah, improved adjusted R squared fo sho.

Btw, which rows were removed by na.omit?

removed_rows_bool <- pancrea_holes_all[names(pancrea_holes_all)!='hole_area_raw'] %>% apply(1, function(row){
  return((row %>% sapply(function(x){is.na(x)}) %>% sum())>0)
})
removed_rows <- pancrea_holes_all[removed_rows_bool, ]
table(removed_rows$hole_num_rep)


 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 20  1 11  1  1 
26 27 28 29 30 31 32 
 1  1  1  1  1  1  1

table(pancrea_holes_all$hole_num_rep) # 21 and 23 are completely removed by the filtering due to NaNs in period


  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
 21  20  55  13 160  56  67 178  74  34  41  58 226 226 224 203  46  25 
 19  20  21  22  23  24  25  26  27  28  29  30  31  32 
226  45  20 171  11 121 161 156 131  60  19  90 121  72

Check similarity between me and siljas counts, using zero-one loss and squared loss.

Zeroone <- function(v1, v2){
  return(1-sum(v1==v2)/length(v1))
}

Squarederror <- function(v1, v2){
  sq.loss <- sum((v1-v2)^2)/length(v1)
  return(sq.loss)
}

Squarederror(split(pancrea_holes_1$numEnds, pancrea_holes_1$hole_num_rep)$`22`, split(pancrea_holes_2$numEnds, pancrea_holes_2$hole_num_rep)$`22`) # sanity check

[1] 0.2588235

CalcError <- function(df1, df2, vectors.to.split, split.by='hole_num_rep'){
  calcerrors <- function(v1, v2){
    return(c(zeroone=Zeroone(v1, v2), squarederror=Squarederror(v1, v2)))
  }
  sub.dfs.by.factor <- vectors.to.split %>% lapply(function(vector.name){
    df1.split <- df1[[vector.name]] %>% split(df1[[split.by]])
    df2.split <- df2[[vector.name]] %>% split(df2[[split.by]])
    return(list(df1.split = df1.split, df2.split = df2.split))
  })
  sub.dfs.by.factor %<>% setNames(vectors.to.split)
  errors.by.factor <- sub.dfs.by.factor %>% lapply(function(data.list){
    errors.by.split <- Map(calcerrors, data.list[[1]], data.list[[2]])
    return(errors.by.split)
  })
  errors.by.factor <- errors.by.factor %>% lapply(function(x){do.call(rbind,x)}) %>% lapply(as.data.frame) %>% lapply(tibble::rownames_to_column, var='hole')
  errors.df <- dplyr::bind_rows(errors.by.factor, .id='compared_factor')
  average.errors <- vectors.to.split %>% sapply(function(vector.name){
    return(calcerrors(df1[[vector.name]], df2[[vector.name]]))
  })
  return(list(errors.df=errors.df, average.errors=average.errors))
}

error_pancrea <- CalcError(pancrea_holes_1, pancrea_holes_2, c('numCon', 'numEnds'))
error_pancrea$errors.df %>% ggplot((aes(x=as.factor(as.numeric(hole)), y=zeroone, col=compared_factor)))+
  theme(axis.text.x = element_text(angle = -90, hjust = 1))+geom_jitter()

error_pancrea$errors.df %>% ggplot((aes(x=as.factor(as.numeric(hole)), y=squarederror, col=compared_factor)))+
  theme(axis.text.x = element_text(angle = -90, hjust = 1))+geom_jitter()

quick plot of fitted vs real values for the various models

plot(linmodel_1$fitted.values~pancrea_holes_1$dAdt)

Version	Author	Date
dc6711c	githubz0r	2020-02-28

plot(step_mass_squared_bi$fitted.values~pancrea_holes_1$dAdt)

Version	Author	Date
dc6711c	githubz0r	2020-02-28

plot(linmodel_interaction_leaps$fitted.values~pancrea_holes_1$dAdt)

Version	Author	Date
dc6711c	githubz0r	2020-02-28

Let’s also just check the correlations

cor(linmodel_1$fitted.values, pancrea_holes_1$dAdt)

[1] 0.2692386

cor(step_mass_squared_bi$fitted.values, pancrea_holes_1$dAdt)

[1] 0.455347

cor(linmodel_interaction_leaps$fitted.values, pancrea_holes_1$dAdt)

[1] 0.4471024

We can also try to down sample to circumvent IID assumption violation

Sampleintervals <- function(df, interval.length, offset=0){
  start.from <- 1+offset
  n <- nrow(df)
  if(start.from >= n || interval.length >= n){
    indices <- sample(n, 1)
  } else {
    indices <- seq(start.from, n, interval.length)
  }
  return(df[indices, ])
}

Makedownsampled <- function(df, interval.length, split.by='hole_num_rep', offset=0){
  sub.dfs <- df %>% split(df[[split.by]])
  downsampled.dfs <- sub.dfs %>% lapply(function(x){return(Sampleintervals(x, interval.length=interval.length, offset=offset))})
  downsampled.rbound <- dplyr::bind_rows(downsampled.dfs)
  return(downsampled.rbound)
}

downsampled_df <- Makedownsampled(pancrea_holes_1, 5, 'hole_num_rep')
linmodel_downsampled <- lm(dAdt~numCon + numEnds + period + hole_area_smooth + eccentricity + solidity, data = downsampled_df)

plot(linmodel_downsampled)

summary(linmodel_downsampled)


Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = downsampled_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-209.518  -18.620    2.285   20.124  164.512 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -5.874e+01  5.522e+01  -1.064 0.287865    
numCon           -5.772e+00  1.651e+00  -3.497 0.000505 ***
numEnds          -6.455e+00  1.800e+00  -3.585 0.000364 ***
period            2.195e-01  3.827e-01   0.574 0.566391    
hole_area_smooth  4.172e-03  9.448e-04   4.415 1.19e-05 ***
eccentricity     -2.612e+01  1.734e+01  -1.506 0.132506    
solidity          8.891e+01  5.073e+01   1.753 0.080172 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.23 on 614 degrees of freedom
Multiple R-squared:  0.07018,   Adjusted R-squared:  0.06109 
F-statistic: 7.724 on 6 and 614 DF,  p-value: 5.08e-08

Things look less fucked, but we have less statistical power now. Quickly check pred vs true plot and correlation. The results are less nighmarish looking but clearly linear model is insufficient.

plot(linmodel_downsampled$fitted.values~downsampled_df$dAdt)

cor(linmodel_downsampled$fitted.values, downsampled_df$dAdt)

[1] 0.2649125

We can try different offsets during the sampling (sampling 1 random point from the hole if offset >= n_hole)

offsets_ <- c(0, 5, 10, 20, 30, 50)
dfs_offsets <- offsets_ %>% lapply(function(x, interval.length=20){
  df.downsampled <- Makedownsampled(pancrea_holes_1, interval.length, offset=x)
  return(df.downsampled)
})

linmodels_offsets <- dfs_offsets %>% lapply(function(x){
  lm.downsampled <- lm(dAdt~numCon + numEnds + period + hole_area_smooth + eccentricity + solidity, data = x)
  return(lm.downsampled)
})

offset_residplots <- linmodels_offsets %>% lapply(function(x){
  return(plot(x, which=c(1,1)))
})

summaries of the models

linmodels_offsets %>% lapply(summary)

[[1]]

Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-195.270  -22.183    1.649   19.861  162.796 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -90.932126 108.098333  -0.841   0.4015  
numCon            -5.540483   3.361793  -1.648   0.1013  
numEnds           -6.427740   3.680218  -1.747   0.0827 .
period             0.584798   0.771029   0.758   0.4493  
hole_area_smooth   0.003776   0.001892   1.996   0.0477 *
eccentricity     -13.593303  32.815545  -0.414   0.6793  
solidity         106.316779  99.522407   1.068   0.2870  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.16 on 158 degrees of freedom
Multiple R-squared:  0.05984,   Adjusted R-squared:  0.02414 
F-statistic: 1.676 on 6 and 158 DF,  p-value: 0.1301


[[2]]

Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-200.164  -18.242    4.274   19.387  157.314 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -62.802756 113.489165  -0.553   0.5808  
numCon            -5.254627   3.438982  -1.528   0.1286  
numEnds           -7.207884   3.696919  -1.950   0.0531 .
period             0.143863   0.777754   0.185   0.8535  
hole_area_smooth   0.004343   0.001968   2.207   0.0288 *
eccentricity     -15.161384  35.997673  -0.421   0.6742  
solidity          83.901839 103.977441   0.807   0.4210  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.56 on 150 degrees of freedom
Multiple R-squared:  0.06591,   Adjusted R-squared:  0.02855 
F-statistic: 1.764 on 6 and 150 DF,  p-value: 0.1102


[[3]]

Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-209.591  -17.245    4.481   23.532  117.249 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -49.055532 116.705140  -0.420   0.6749  
numCon            -5.861485   3.435513  -1.706   0.0901 .
numEnds           -6.900645   3.712004  -1.859   0.0650 .
period            -0.189559   0.775461  -0.244   0.8072  
hole_area_smooth   0.004744   0.001994   2.380   0.0186 *
eccentricity     -30.775003  37.341085  -0.824   0.4112  
solidity          91.384664 106.776636   0.856   0.3935  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.39 on 146 degrees of freedom
Multiple R-squared:  0.08217,   Adjusted R-squared:  0.04445 
F-statistic: 2.178 on 6 and 146 DF,  p-value: 0.04827


[[4]]

Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-188.287  -22.471    0.868   21.522  156.187 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -52.115533 133.680267  -0.390   0.6973  
numCon            -6.010118   3.586520  -1.676   0.0962 .
numEnds           -7.623049   4.074120  -1.871   0.0635 .
period             0.784561   0.865185   0.907   0.3662  
hole_area_smooth   0.003699   0.002089   1.771   0.0789 .
eccentricity     -64.276361  44.146099  -1.456   0.1478  
solidity         104.513321 120.720899   0.866   0.3882  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.43 on 132 degrees of freedom
Multiple R-squared:  0.08692,   Adjusted R-squared:  0.04541 
F-statistic: 2.094 on 6 and 132 DF,  p-value: 0.05802


[[5]]

Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-205.097  -16.333    3.551   22.982  120.209 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -66.296600 138.709979  -0.478   0.6335  
numCon            -5.189165   3.731084  -1.391   0.1668  
numEnds           -7.252381   4.233679  -1.713   0.0893 .
period            -0.619042   0.894353  -0.692   0.4902  
hole_area_smooth   0.004042   0.002229   1.813   0.0723 .
eccentricity     -24.018157  46.593588  -0.515   0.6072  
solidity         113.678419 126.253800   0.900   0.3697  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.48 on 121 degrees of freedom
Multiple R-squared:  0.07717,   Adjusted R-squared:  0.03141 
F-statistic: 1.686 on 6 and 121 DF,  p-value: 0.13


[[6]]

Call:
lm(formula = dAdt ~ numCon + numEnds + period + hole_area_smooth + 
    eccentricity + solidity, data = x)

Residuals:
     Min       1Q   Median       3Q      Max 
-195.112  -14.312    1.228   20.540  106.958 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)      -1.151e+02  1.498e+02  -0.769    0.444
numCon           -5.950e+00  3.987e+00  -1.492    0.139
numEnds          -6.207e+00  4.696e+00  -1.322    0.189
period           -3.459e-01  1.013e+00  -0.342    0.733
hole_area_smooth  2.918e-03  2.397e-03   1.218    0.226
eccentricity      1.056e+01  4.919e+01   0.215    0.830
solidity          1.323e+02  1.382e+02   0.957    0.341

Residual standard error: 42.49 on 100 degrees of freedom
Multiple R-squared:  0.06367,   Adjusted R-squared:  0.007489 
F-statistic: 1.133 on 6 and 100 DF,  p-value: 0.3486

just a quick sanity test because I forgot why permuting doesn’t change linear regression results, but ofc it’s because permutation matrix cancels out in the formula

quicktest <- lm(dAdt~numCon + numEnds + period + hole_area_smooth + eccentricity + solidity, data = pancrea_holes_1[sample(1:nrow(pancrea_holes_1)), ])

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] leaps_3.1       MASS_7.3-51.5   cowplot_1.0.0   magrittr_1.5   
 [5] irlba_2.3.3     Matrix_1.2-17   forcats_0.4.0   stringr_1.4.0  
 [9] dplyr_0.8.3     purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
[13] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       lubridate_1.7.4  lattice_0.20-38  assertthat_0.2.1
 [5] rprojroot_1.3-2  digest_0.6.21    R6_2.4.0         cellranger_1.1.0
 [9] plyr_1.8.4       backports_1.1.4  reprex_0.3.0     evaluate_0.14   
[13] httr_1.4.1       pillar_1.4.2     rlang_0.4.4      lazyeval_0.2.2  
[17] readxl_1.3.1     rstudioapi_0.10  whisker_0.4      rmarkdown_1.15  
[21] labeling_0.3     munsell_0.5.0    broom_0.5.2      compiler_3.6.1  
[25] httpuv_1.5.2     modelr_0.1.5     xfun_0.9         pkgconfig_2.0.3 
[29] htmltools_0.3.6  tidyselect_0.2.5 workflowr_1.6.0  crayon_1.3.4    
[33] dbplyr_1.4.2     withr_2.1.2      later_0.8.0      grid_3.6.1      
[37] nlme_3.1-141     jsonlite_1.6     gtable_0.3.0     lifecycle_0.1.0 
[41] DBI_1.0.0        git2r_0.26.1     scales_1.0.0     cli_1.1.0       
[45] stringi_1.4.3    reshape2_1.4.3   fs_1.3.1         promises_1.0.1  
[49] xml2_1.2.2       generics_0.0.2   vctrs_0.2.2      tools_3.6.1     
[53] glue_1.3.1       hms_0.5.3        yaml_2.2.0       colorspace_1.4-1
[57] rvest_0.3.5      knitr_1.25       haven_2.2.0

first_steps

Githubz0r

2/13/2020

preparation

analysis