Title: | Proteomics Data Analysis and Modeling Tools |
---|---|
Description: | A comprehensive, user-friendly package for label-free proteomics data analysis and machine learning-based modeling. Data generated from 'MaxQuant' can be easily used to conduct differential expression analysis, build predictive models with top protein candidates, and assess model performance. promor includes a suite of tools for quality control, visualization, missing data imputation (Lazar et. al. (2016) <doi:10.1021/acs.jproteome.5b00981>), differential expression analysis (Ritchie et. al. (2015) <doi:10.1093/nar/gkv007>), and machine learning-based modeling (Kuhn (2008) <doi:10.18637/jss.v028.i05>). |
Authors: | Chathurani Ranathunge [aut, cre, cph] |
Maintainer: | Chathurani Ranathunge <[email protected]> |
License: | LGPL (>= 2.1) |
Version: | 0.2.1 |
Built: | 2024-11-08 04:29:19 UTC |
Source: | https://github.com/caranathunge/promor |
This function computes average intensities across technical replicates for each sample.
aver_techreps(raw_df)
aver_techreps(raw_df)
raw_df |
A |
aver_techreps
assumes that column names in the data frame
follow the "Group_UniqueSampleID_TechnicalReplicate" notation. (Use
head(raw_df)
to see the structure of the raw_df
object.)
A raw_df
object of averaged intensities.
Chathurani Ranathunge
## Use a data set containing technical replicates to create a raw_df object raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", tech_reps = TRUE ) # Compute average intensities across technical replicates. rawdf_ave <- aver_techreps(raw_df)
## Use a data set containing technical replicates to create a raw_df object raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", tech_reps = TRUE ) # Compute average intensities across technical replicates. rawdf_ave <- aver_techreps(raw_df)
This function generates scatter plots to visualize the correlation between a given pair of technical replicates (Eg: 1 vs 2) for each sample.
corr_plot( raw_df, rep_1, rep_2, save = FALSE, file_type = "pdf", palette = "viridis", text_size = 5, n_row = 4, n_col = 4, dpi = 80, file_path = NULL )
corr_plot( raw_df, rep_1, rep_2, save = FALSE, file_type = "pdf", palette = "viridis", text_size = 5, n_row = 4, n_col = 4, dpi = 80, file_path = NULL )
raw_df |
A |
rep_1 |
Numerical. Technical replicate number. |
rep_2 |
Numerical. Number of the second technical replicate to compare
to |
save |
Logical. If |
file_type |
File type to save the scatter plots.
Default is |
palette |
Viridis color palette option for plots. Default is
|
text_size |
Text size for plot labels, axis labels etc. Default is
|
n_row |
Numerical. Number of plots to print in a row in a single page.
Default is |
n_col |
Numerical. Number of plots to print in a column in a single
page. Default is |
dpi |
Plot resolution. Default is |
file_path |
A string containing the directory path to save the file. |
Given a data frame of log-transformed intensities
(a raw_df
object) and a pair of numbers referring to the technical
replicates, corr_plot
produces a list of scatter plots showing
correlation between the given pair of technical replicates for all the
samples provided in the data frame.
Note: n_row
* n_col
should be equal to the number of
samples to display in a single page.
A list of ggplot2
plot objects.
Chathurani Ranathunge
create_df
## Use a data set containing technical replicates to create a raw_df object raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", tech_reps = TRUE ) ## Compare technical replicates 1 vs. 2 for all samples corr_plot(raw_df, rep_1 = 1, rep_2 = 2)
## Use a data set containing technical replicates to create a raw_df object raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", tech_reps = TRUE ) ## Compare technical replicates 1 vs. 2 for all samples corr_plot(raw_df, rep_1 = 1, rep_2 = 2)
An object of class "MArrayLM" from running find_dep on covid_norm_df
data(covid_fit_df)
data(covid_fit_df)
An object of class "MArrayLM"
https://www.frontiersin.org/articles/10.3389/fphys.2021.652799/full#h3
A dataframe containing normalized LFQ protein intensity data for 230 proteins in 35 samples (a subset of the original data set)
data(covid_norm_df)
data(covid_norm_df)
A data frame with 230 rows (proteins) and 35 columns (samples)
https://www.frontiersin.org/articles/10.3389/fphys.2021.652799/full#h3
This function creates a data frame of protein intensities
create_df( prot_groups, exp_design, input_type = "MaxQuant", data_type = "LFQ", filter_na = TRUE, filter_prot = TRUE, uniq_pep = 2, tech_reps = FALSE, zero_na = TRUE, log_tr = TRUE, base = 2 )
create_df( prot_groups, exp_design, input_type = "MaxQuant", data_type = "LFQ", filter_na = TRUE, filter_prot = TRUE, uniq_pep = 2, tech_reps = FALSE, zero_na = TRUE, log_tr = TRUE, base = 2 )
prot_groups |
File path to a proteinGroups.txt file produced by MaxQuant or a standard input file containing a quantitative matrix where the proteins or protein groups are indicated by rows and the samples by columns. |
exp_design |
File path to a text file containing the experimental design. |
input_type |
Type of input file indicated by |
data_type |
Type of sample protein intensity data columns to use from
the proteinGroups.txt file. Some available options are "LFQ", "iBAQ",
"Intensity". Default is "LFQ." User-defined prefixes in the proteinGroups.txt
file are also allowed. The |
filter_na |
Logical. If |
filter_prot |
Logical. If |
uniq_pep |
Numerical. Proteins that are identified by this number or
fewer number of unique peptides are filtered out (default is 2).Only applies
when |
tech_reps |
Logical. Indicate as |
zero_na |
Logical. If |
log_tr |
Logical. If |
base |
Numerical. Logarithm base. Default is 2. |
This function first reads in the proteinGroups.txt file produced by MaxQuant or a standard input file containing a quantitative matrix where the proteins or protein groups are indicated by rows and the samples by columns.
It then reads in the expDesign.txt file provided as
exp_design
and extracts relevant information from it to add to the
data frame. an example of the expDesign.txt is provided here:
https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt.
First, empty rows and columns are removed from the data frame.
Next, if a proteinGroups.txt file is used, it filters out reverse
proteins, proteins that were only identified by site, and potential
contaminants.Then it removes proteins identified with less than
the number of unique peptides indicated by uniq_pep
from the
data frame.
Next, it extracts the intensity columns indicated by data type
and the selected protein rows from the data frame.
Converts missing values (zeros) to NAs.
Finally, the function log transforms the intensity values.
A raw_df
object which is a data frame containing protein
intensities. Proteins or protein groups are indicated by rows and samples
by columns.
Chathurani Ranathunge
### Using a proteinGroups.txt file produced by MaxQuant as input. ## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant" ) ## Data containing technical replicates raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", input_type = "MaxQuant", tech_reps = TRUE ) ## Alter the number of unique peptides needed to retain a protein raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant", uniq_pep = 1 ) ## Use "iBAQ" values instead of "LFQ" values raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant", data_type = "iBAQ" ) ### Using a universal standard input file instead of MaxQuant output. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/st.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "standard" )
### Using a proteinGroups.txt file produced by MaxQuant as input. ## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant" ) ## Data containing technical replicates raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", input_type = "MaxQuant", tech_reps = TRUE ) ## Alter the number of unique peptides needed to retain a protein raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant", uniq_pep = 1 ) ## Use "iBAQ" values instead of "LFQ" values raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant", data_type = "iBAQ" ) ### Using a universal standard input file instead of MaxQuant output. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/st.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "standard" )
An object of class "MArrayLM" from running find_dep on ecoli_norm_df
data(ecoli_fit_df)
data(ecoli_fit_df)
An object of class "MArrayLM"
https://europepmc.org/article/MED/24942700#id609082
A dataframe containing normalized LFQ protein intensity data for 4360 proteins in 6 samples
data(ecoli_norm_df)
data(ecoli_norm_df)
A data frame with 4360 rows (proteins) and 6 columns (samples)
https://europepmc.org/article/MED/24942700#id609082
This function visualizes protein intensity differences among conditions (classes) using box plots or density distribution plots.
feature_plot( model_df, type = "box", text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "Feature_plot", file_type = "pdf", dpi = 80, plot_width = 7, plot_height = 7 )
feature_plot( model_df, type = "box", text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "Feature_plot", file_type = "pdf", dpi = 80, plot_width = 7, plot_height = 7 )
model_df |
A |
type |
Type of plot to generate. Choices are "box" or "density." Default
is |
text_size |
Text size for plot labels, axis labels etc. Default is
|
palette |
Viridis color palette option for plots. Default is
|
n_row |
Number of rows to print the plots. |
n_col |
Number of columns to print the plots. |
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the plot.
Default is |
file_type |
File type to save the plot.
Default is |
dpi |
Plot resolution. Default is |
plot_width |
Width of the plot. Default is |
plot_height |
Height of the plot. Default is |
This function visualizes condition-wise differences in protein intensity using boxplots and/or density plots.
A ggplot2
object
Chathurani Ranathunge
pre_process
, rem_feature
## Create a model_df object with default settings. covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Feature variation - box plots feature_plot(covid_model_df, type = "box", n_row = 4, n_col = 2) ## Density plots feature_plot(covid_model_df, type = "density") ## Change color palette feature_plot(covid_model_df, type = "density", n_row = 4, n_col = 2, palette = "rocket")
## Create a model_df object with default settings. covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Feature variation - box plots feature_plot(covid_model_df, type = "box", n_row = 4, n_col = 2) ## Density plots feature_plot(covid_model_df, type = "density") ## Change color palette feature_plot(covid_model_df, type = "density", n_row = 4, n_col = 2, palette = "rocket")
This function filters out proteins based on missing data at the group level.
filterbygroup_na(raw_df, set_na = 0.34, filter_condition = "either")
filterbygroup_na(raw_df, set_na = 0.34, filter_condition = "either")
raw_df |
A |
set_na |
The proportion of missing data allowed. Default is 0.34 (one third of the samples in the group). |
filter_condition |
If set to |
This function first
extracts group or condition information from the raw_df
object and
assigns samples to their groups.
If filter_condition = "each"
, it then removes proteins (rows)
from the data frame if the proportion of NAs in each group exceeds the
threshold indicated by set_na
(default is 0.34). This option is
more lenient in comparison to filter_condition = "either"
, where
proteins that exceeds the missing data threshold in either group gets
removed from the data frame.
A raw_df
object.
Chathurani Ranathunge
# Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Remove proteins that exceed 34% NAs in either group (default) rawdf_filt1 <- filterbygroup_na(raw_df) ## Remove proteins that exceed 34% NAs in each group rawdf_filt2 <- filterbygroup_na(raw_df, filter_condition = "each") ## Proportion of samples with NAs allowed in each group = 0.5 rawdf_filt3 <- filterbygroup_na(raw_df, set_na = 0.5, filter_condition = "each")
# Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Remove proteins that exceed 34% NAs in either group (default) rawdf_filt1 <- filterbygroup_na(raw_df) ## Remove proteins that exceed 34% NAs in each group rawdf_filt2 <- filterbygroup_na(raw_df, filter_condition = "each") ## Proportion of samples with NAs allowed in each group = 0.5 rawdf_filt3 <- filterbygroup_na(raw_df, set_na = 0.5, filter_condition = "each")
This function performs differential expression analysis on protein intensity data with limma.
find_dep( df, save_output = FALSE, save_tophits = FALSE, file_path = NULL, adj_method = "BH", cutoff = 0.05, lfc = 1, n_top = 20 )
find_dep( df, save_output = FALSE, save_tophits = FALSE, file_path = NULL, adj_method = "BH", cutoff = 0.05, lfc = 1, n_top = 20 )
df |
A |
save_output |
Logical. If |
save_tophits |
Logical. If |
file_path |
A string containing the directory path to save the file. |
adj_method |
Method used for adjusting the p-values for multiple
testing. Default is |
cutoff |
Cutoff value for p-values and adjusted p-values. Default is 0.05. |
lfc |
Minimum absolute log2-fold change to use as threshold for differential expression. |
n_top |
The number of top differentially expressed proteins to save in
the "TopHits.txt" file. Default is |
It is important that the data is first log-transformed, ideally, imputed, and normalized before performing differential expression analysis.
save_output
saves the complete results table from the
differential expression analysis.
save_tophits
first subsets the results to those with absolute
log fold change of more than 1, performs multiple correction with
the method specified in adj_method
and outputs the top n_top
results based on lowest p-value and adjusted p-value.
If the number of hits with absolute log fold change of more than 1 is
less than n_top
, find_dep
prints only those with
log-fold change > 1 to "TopHits.txt".
If the file_path
is not specified, text files will be saved in
a temporary directory.
A fit_df
object, which is similar to a limma
fit
object.
Chathurani Ranathunge
Ritchie, Matthew E., et al. "limma powers differential expression analyses for RNA-sequencing and microarray studies." Nucleic acids research 43.7 (2015): e47-e47.
## Perform differential expression analysis using default settings fit_df1 <- find_dep(ecoli_norm_df) ## Change p-value and adjusted p-value cutoff fit_df2 <- find_dep(ecoli_norm_df, cutoff = 0.1)
## Perform differential expression analysis using default settings fit_df1 <- find_dep(ecoli_norm_df) ## Change p-value and adjusted p-value cutoff fit_df2 <- find_dep(ecoli_norm_df, cutoff = 0.1)
This function generates a heatmap to visualize differentially expressed proteins between groups
heatmap_de( fit_df, df, adj_method = "BH", cutoff = 0.05, lfc = 1, sig = "adjP", n_top = 20, palette = "viridis", text_size = 10, save = FALSE, file_path = NULL, file_name = "HeatmapDE", file_type = "pdf", dpi = 80, plot_height = 7, plot_width = 7 )
heatmap_de( fit_df, df, adj_method = "BH", cutoff = 0.05, lfc = 1, sig = "adjP", n_top = 20, palette = "viridis", text_size = 10, save = FALSE, file_path = NULL, file_name = "HeatmapDE", file_type = "pdf", dpi = 80, plot_height = 7, plot_width = 7 )
fit_df |
A |
df |
The |
adj_method |
Method used for adjusting the p-values for multiple
testing. Default is |
cutoff |
Cutoff value for p-values and adjusted p-values. Default is 0.05. |
lfc |
Minimum absolute log2-fold change to use as threshold for differential expression. Default is 1. |
sig |
Criteria to denote significance. Choices are |
n_top |
Number of top hits to include in the heat map. |
palette |
Viridis color palette option for plots. Default is
|
text_size |
Text size for axis text, labels etc. |
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the plot. Default is "HeatmapDE." |
file_type |
File type to save the plot. Default is |
dpi |
Plot resolution. Default is |
plot_height |
Height of the plot. Default is 7. |
plot_width |
Width of the plot. Default is 7. |
By default the tiles in the heatmap are reordered by intensity values along both axes (x axis = samples, y axis = proteins).
A ggplot2
plot object.
Chathurani Ranathunge
## Build a heatmap of differentially expressed proteins using the provided ## example fit_df and norm_df data objects heatmap_de(covid_fit_df, covid_norm_df) ## Create a heatmap with P-value of 0.05 and log fold change of 1 as ## significance criteria. heatmap_de(covid_fit_df, covid_norm_df, cutoff = 0.05, sig = "P") ## Visualize the top 30 differentially expressed proteins in the heatmap and ## change the color palette heatmap_de(covid_fit_df, covid_norm_df, cutoff = 0.05, sig = "P", n_top = 30, palette = "magma" )
## Build a heatmap of differentially expressed proteins using the provided ## example fit_df and norm_df data objects heatmap_de(covid_fit_df, covid_norm_df) ## Create a heatmap with P-value of 0.05 and log fold change of 1 as ## significance criteria. heatmap_de(covid_fit_df, covid_norm_df, cutoff = 0.05, sig = "P") ## Visualize the top 30 differentially expressed proteins in the heatmap and ## change the color palette heatmap_de(covid_fit_df, covid_norm_df, cutoff = 0.05, sig = "P", n_top = 30, palette = "magma" )
This function visualizes the patterns of missing value occurrence using a heatmap.
heatmap_na( raw_df, protein_range, sample_range, reorder_x = FALSE, reorder_y = FALSE, x_fun = mean, y_fun = mean, palette = "viridis", label_proteins = FALSE, text_size = 10, save = FALSE, file_type = "pdf", file_path = NULL, file_name = "Missing_data_heatmap", plot_width = 15, plot_height = 15, dpi = 80 )
heatmap_na( raw_df, protein_range, sample_range, reorder_x = FALSE, reorder_y = FALSE, x_fun = mean, y_fun = mean, palette = "viridis", label_proteins = FALSE, text_size = 10, save = FALSE, file_type = "pdf", file_path = NULL, file_name = "Missing_data_heatmap", plot_width = 15, plot_height = 15, dpi = 80 )
raw_df |
A |
protein_range |
The range or subset of proteins (rows) to plot. If not provided, all the proteins (rows) in the data frame will be used. |
sample_range |
The range of samples to plot. If not provided, all the samples (columns) in the data frame will be used. |
reorder_x |
Logical. If |
reorder_y |
Logical. If |
x_fun |
Function to reorder samples along the x axis. Possible options
are |
y_fun |
Function to reorder proteins along the y axis. Possible options
are |
palette |
Viridis color palette option for plots. Default is
|
label_proteins |
If |
text_size |
Text size for axis labels. Default is |
save |
Logical. If |
file_type |
File type to save the heatmap. Default is |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the heatmap. Default is
|
plot_width |
Width of the plot. Default is |
plot_height |
Height of the plot. Default is |
dpi |
Plot resolution. Default is |
This function visualizes patterns of missing value occurrence using a
heatmap. The user can choose to reorder the axes using the available functions
(x_fun
, y_fun
) to better understand the underlying cause of
missing data.
A ggplot2
plot object.
Chathurani Ranathunge
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Missing data heatmap with default settings. heatmap_na(raw_df) ## Missing data heatmap with x and y axes reordered by the mean (default) of ## protein intensity. heatmap_na(raw_df, reorder_x = TRUE, reorder_y = TRUE ) ## Missing data heatmap with x and y axes reordered by the sum of ## protein intensity. heatmap_na(raw_df, reorder_x = TRUE, reorder_y = TRUE, x_fun = sum, y_fun = sum ) ## Missing data heatmap for a subset of the proteins with x and y axes ## reordered by the mean (default) of protein intensity and the y axis ## labeled with protein IDs. heatmap_na(raw_df, protein_range = 1:30, reorder_x = TRUE, reorder_y = TRUE, label_proteins = TRUE )
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Missing data heatmap with default settings. heatmap_na(raw_df) ## Missing data heatmap with x and y axes reordered by the mean (default) of ## protein intensity. heatmap_na(raw_df, reorder_x = TRUE, reorder_y = TRUE ) ## Missing data heatmap with x and y axes reordered by the sum of ## protein intensity. heatmap_na(raw_df, reorder_x = TRUE, reorder_y = TRUE, x_fun = sum, y_fun = sum ) ## Missing data heatmap for a subset of the proteins with x and y axes ## reordered by the mean (default) of protein intensity and the y axis ## labeled with protein IDs. heatmap_na(raw_df, protein_range = 1:30, reorder_x = TRUE, reorder_y = TRUE, label_proteins = TRUE )
This function imputes missing values using a user-specified imputation method.
impute_na( df, method = "minProb", tune_sigma = 1, q = 0.01, maxiter = 10, ntree = 20, n_pcs = 2, seed = NULL )
impute_na( df, method = "minProb", tune_sigma = 1, q = 0.01, maxiter = 10, ntree = 20, n_pcs = 2, seed = NULL )
df |
A |
method |
Imputation method to use. Default is |
tune_sigma |
A scalar used in the |
q |
A scalar used in |
maxiter |
Maximum number of iterations to be performed when using the
|
ntree |
Number of trees to grow in each forest when using the
|
n_pcs |
Number of principal components to calculate when using the
|
seed |
Numerical. Random number seed. Default is |
Ideally, you should first remove proteins with
high levels of missing data using the filterbygroup_na
function
before running impute_na
on the raw_df
object or the
norm_df
object.
impute_na
function imputes missing values using a
user-specified imputation method from the available options, minProb
,
minDet
, kNN
, RF
, and SVD
.
Note: Some imputation methods may require that the data be normalized prior to imputation.
Make sure to fix the random number seed with seed
for reproducibility
.
An imp_df
object, which is a data frame of protein intensities
with no missing values.
Chathurani Ranathunge
Lazar, Cosmin, et al. "Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies." Journal of proteome research 15.4 (2016): 1116-1125.
More information on the available imputation methods can be found in their respective packages.
For minProb
and
minDet
methods, see
imputeLCMD
package.
For Random Forest (RF
) method, see
missForest
.
For SVD
method, see pca
from the
pcaMethods
package.
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method. imp_df1 <- impute_na(raw_df, seed = 3312) ## Impute using the RF method with the number of iterations set at 5 ## and number of trees set at 100. imp_df2 <- impute_na(raw_df, method = "RF", maxiter = 5, ntree = 100, seed = 3312 ) ## Using the kNN method. imp_df3 <- impute_na(raw_df, method = "kNN", seed = 3312) ## Using the SVD method with n_pcs set to 3. imp_df4 <- impute_na(raw_df, method = "SVD", n_pcs = 3, seed = 3312) ## Using the minDet method with q set at 0.001. imp_df5 <- impute_na(raw_df, method = "minDet", q = 0.001, seed = 3312) ## Impute a normalized data set using the kNN method imp_df6 <- impute_na(ecoli_norm_df, method = "kNN")
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method. imp_df1 <- impute_na(raw_df, seed = 3312) ## Impute using the RF method with the number of iterations set at 5 ## and number of trees set at 100. imp_df2 <- impute_na(raw_df, method = "RF", maxiter = 5, ntree = 100, seed = 3312 ) ## Using the kNN method. imp_df3 <- impute_na(raw_df, method = "kNN", seed = 3312) ## Using the SVD method with n_pcs set to 3. imp_df4 <- impute_na(raw_df, method = "SVD", n_pcs = 3, seed = 3312) ## Using the minDet method with q set at 0.001. imp_df5 <- impute_na(raw_df, method = "minDet", q = 0.001, seed = 3312) ## Impute a normalized data set using the kNN method imp_df6 <- impute_na(ecoli_norm_df, method = "kNN")
This function generates density plots to visualize the impact of missing data imputation on the data.
impute_plot( original, imputed, global = TRUE, text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "Impute_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80 )
impute_plot( original, imputed, global = TRUE, text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "Impute_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80 )
original |
A |
imputed |
An |
global |
Logical. If |
text_size |
Text size for plot labels, axis labels etc. Default is
|
palette |
Viridis color palette option for plots. Default is
|
n_row |
Used if |
n_col |
Used if |
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the density plot/s.
Default is |
file_type |
File type to save the density plot/s.
Default is |
plot_width |
Width of the plot. Default is |
plot_height |
Height of the plot. Default is |
dpi |
Plot resolution. Default is |
Given two data frames, one with missing values
and the other, an imputed data frame (imp_df
object) of the same
data set, impute_plot
generates global or sample-wise density plots
to visualize the impact of imputation on the data set.
Note, when sample-wise option is selected (global = FALSE
),
n_col
and n_row
can be used to specify the number of columns
and rows to print the plots.
If you choose to specify n_row
and n_col
, make sure that
n_row
* n_col
matches the total number of samples in the
data frame.
A ggplot2
plot object.
Chathurani Ranathunge
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method. imp_df <- impute_na(raw_df) ## Visualize the impact of missing data imputation with a global density ## plot. impute_plot(original = raw_df, imputed = imp_df) ## Make sample-wise density plots impute_plot(raw_df, imp_df, global = FALSE) ## Print plots in user-specified numbers of rows and columns impute_plot(raw_df, imp_df, global = FALSE, n_col = 2, n_row = 3)
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method. imp_df <- impute_na(raw_df) ## Visualize the impact of missing data imputation with a global density ## plot. impute_plot(original = raw_df, imputed = imp_df) ## Make sample-wise density plots impute_plot(raw_df, imp_df, global = FALSE) ## Print plots in user-specified numbers of rows and columns impute_plot(raw_df, imp_df, global = FALSE, n_col = 2, n_row = 3)
This function visualizes the impact of normalization on the data
norm_plot( original, normalized, type = "box", text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "Norm_plot", file_type = "pdf", dpi = 80, plot_width = 10, plot_height = 7 )
norm_plot( original, normalized, type = "box", text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "Norm_plot", file_type = "pdf", dpi = 80, plot_width = 10, plot_height = 7 )
original |
A |
normalized |
A |
type |
Type of plot to generate. Choices are "box" or "density." Default
is |
text_size |
Text size for plot labels, axis labels etc. Default is
|
palette |
Viridis color palette option for plots. Default is
|
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the plot.
Default is |
file_type |
File type to save the plot.
Default is |
dpi |
Plot resolution. Default is |
plot_width |
Width of the plot. Default is |
plot_height |
Height of the plot. Default is |
Given two data frames, one with data prior to normalization
(original
), and the other, after normalization (normalized
),
norm_plot
generates side-by-side plots to visualize the effect of
normalization on the protein intensity data.
A ggplot2
plot object.
Chathurani Ranathunge
create_df
impute_na
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method. imp_df <- impute_na(raw_df) ## Normalize the imp_df object using the default quantile method norm_df <- normalize_data(imp_df) ## Visualize normalization using box plots norm_plot(original = imp_df, normalized = norm_df) ## Visualize normalization using density plots norm_plot(imp_df, norm_df, type = "density")
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method. imp_df <- impute_na(raw_df) ## Normalize the imp_df object using the default quantile method norm_df <- normalize_data(imp_df) ## Visualize normalization using box plots norm_plot(original = imp_df, normalized = norm_df) ## Visualize normalization using density plots norm_plot(imp_df, norm_df, type = "density")
This function normalizes data using a user-specified normalization method.
normalize_data(df, method = "quantile")
normalize_data(df, method = "quantile")
df |
An |
method |
Name of the normalization method to use. Choices are
|
normalize_data
is a wrapper function around
the normalizeBetweenArrays
function from the
limma
package.
This function normalizes intensity values to achieve consistency among samples.
It assumes that the intensities in the
data frame have been log-transformed, therefore, it is important to make sure
that create_df
was run with log_tr = TRUE
(default) when
creating the raw_df
object.
A norm_df
object, which is a data frame of
normalized protein intensities.
Chathurani Ranathunge
create_df
impute_na
See normalizeBetweenArrays
in the R package
limma
for more information on the different normalization methods
available.
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method prioir to normalization. imp_df <- impute_na(raw_df) ## Normalize the imp_df object using the default quantile method norm_df1 <- normalize_data(imp_df) ## Use the cyclicloess method norm_df2 <- normalize_data(imp_df, method = "cyclicloess") ## Normalize data in the raw_df object prior to imputation. norm_df3 <- normalize_data(raw_df)
## Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Impute missing values in the data frame using the default minProb ## method prioir to normalization. imp_df <- impute_na(raw_df) ## Normalize the imp_df object using the default quantile method norm_df1 <- normalize_data(imp_df) ## Use the cyclicloess method norm_df2 <- normalize_data(imp_df, method = "cyclicloess") ## Normalize data in the raw_df object prior to imputation. norm_df3 <- normalize_data(raw_df)
This function outputs a list of proteins that are only expressed (present) in one user-specified group while not expressed (completely absent) in another user-specified group.
onegroup_only( raw_df, abs_group, pres_group, set_na = 0.34, save = FALSE, file_path = NULL )
onegroup_only( raw_df, abs_group, pres_group, set_na = 0.34, save = FALSE, file_path = NULL )
raw_df |
A |
abs_group |
Name of the group in which proteins are not expressed. |
pres_group |
Name of the group in which proteins are expressed. |
set_na |
The percentage of missing data allowed in |
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
Note: onegroup_only
function assumes that column names in
the raw_df
object provided as df
follow "Group_UniqueSampleID"
notation. (Use head(raw_df)
to check the structure of your
raw_df
object.)
Given a pair of groups, onegroup_only
function finds proteins that are only expressed in pres_group
while
completely absent or not expressed in abs_group
.
A text file containing majority protein IDs will be saved in a
temporary directory if file_path
is not specified.
A list of majority protein IDs.
Chathurani Ranathunge
# Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Find the proteins only expressed in group L, but absent in group H. onegroup_only(raw_df, abs_group = "H", pres_group = "L")
# Generate a raw_df object with default settings. No technical replicates. raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt" ) ## Find the proteins only expressed in group L, but absent in group H. onegroup_only(raw_df, abs_group = "H", pres_group = "L")
This function generates plots to visualize model performance
performance_plot( model_list, type = "box", text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "Performance_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80 )
performance_plot( model_list, type = "box", text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "Performance_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80 )
model_list |
A |
type |
Type of plot to generate. Choices are "box" or "dot."
Default is |
text_size |
Text size for plot labels, axis labels etc. Default is
|
palette |
Viridis color palette option for plots. Default is
|
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the plot.
Default is |
file_type |
File type to save the plot.
Default is |
plot_width |
Width of the plot. Default is |
plot_height |
Height of the plot. Default is |
dpi |
Plot resolution. Default is |
performance_plot
uses resampling results from
models included in the model_list
to generate plots showing model
performance.
The default metrics used for classification based models are "Accuracy" and "Kappa."
These metric types can be changed by providing additional arguments to
the train_models
function. See train
and
trainControl
for more information.
A ggplot2
object.
Chathurani Ranathunge
train_models
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models based on the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) ## Generate box plots to visualize performance of different ML algorithms performance_plot(covid_model_list) ## Generate dot plots performance_plot(covid_model_list, type = "dot") ## Change color palette performance_plot(covid_model_list, type = "dot", palette = "inferno")
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models based on the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) ## Generate box plots to visualize performance of different ML algorithms performance_plot(covid_model_list) ## Generate dot plots performance_plot(covid_model_list, type = "dot") ## Change color palette performance_plot(covid_model_list, type = "dot", palette = "inferno")
This function pre-processes protein intensity data from
the top differentially expressed proteins identified with find_dep
for
modeling.
pre_process( fit_df, norm_df, sig = "adjP", sig_cutoff = 0.05, fc = 1, n_top = 20, find_highcorr = TRUE, corr_cutoff = 0.9, save_corrmatrix = FALSE, file_path = NULL, rem_highcorr = TRUE )
pre_process( fit_df, norm_df, sig = "adjP", sig_cutoff = 0.05, fc = 1, n_top = 20, find_highcorr = TRUE, corr_cutoff = 0.9, save_corrmatrix = FALSE, file_path = NULL, rem_highcorr = TRUE )
fit_df |
A |
norm_df |
The |
sig |
Criteria to denote significance in differential expression.
Choices are |
sig_cutoff |
Cutoff value for p-values and adjusted p-values in
differential expression. Default is |
fc |
Minimum absolute log-fold change to use as threshold for
differential expression. Default is |
n_top |
The number of top hits from |
find_highcorr |
Logical. If |
corr_cutoff |
A numeric value specifying the correlation cutoff.
Default is |
save_corrmatrix |
Logical. If |
file_path |
A string containing the directory path to save the file. |
rem_highcorr |
Logical. If |
This function creates a data frame that contains protein intensities for a user-specified number of top differentially expressed proteins.
Using find_highcorr = TRUE
, highly correlated
proteins can be identified, and can be removed with
rem_highcorr = TRUE
.
Note: Most models will benefit from reducing correlation between proteins (predictors or features), therefore we recommend removing those proteins at this stage to reduce pairwise-correlation.
If no or few proteins meet the significance threshold for differential
expression, you may adjust sig
, fc
, and/or sig_cutoff
accordingly to obtain more proteins for modeling.
A model_df
object, which is a data frame of protein
intensities with proteins indicated by columns.
Chathurani Ranathunge
find_dep
, normalize_data
## Create a model_df object with default settings. covid_model_df1 <- pre_process(fit_df = covid_fit_df, norm_df = covid_norm_df) ## Change the correlation cutoff. covid_model_df2 <- pre_process(covid_fit_df, covid_norm_df, corr_cutoff = 0.95) ## Change the significance criteria to include more proteins covid_model_df3 <- pre_process(covid_fit_df, covid_norm_df, sig = "P") ## Change the number of top differentially expressed proteins to include covid_model_df4 <- pre_process(covid_fit_df, covid_norm_df, sig = "P", n_top = 24)
## Create a model_df object with default settings. covid_model_df1 <- pre_process(fit_df = covid_fit_df, norm_df = covid_norm_df) ## Change the correlation cutoff. covid_model_df2 <- pre_process(covid_fit_df, covid_norm_df, corr_cutoff = 0.95) ## Change the significance criteria to include more proteins covid_model_df3 <- pre_process(covid_fit_df, covid_norm_df, sig = "P") ## Change the number of top differentially expressed proteins to include covid_model_df4 <- pre_process(covid_fit_df, covid_norm_df, sig = "P", n_top = 24)
This function removes user-specified proteins from a model_df
object
rem_feature(model_df, rem_protein)
rem_feature(model_df, rem_protein)
model_df |
A |
rem_protein |
Name of the protein to remove. |
After visualizing protein intensity variation
among conditions with feature_plot
or after assessing the importance
of each protein in models using varimp_plot
, you can choose to remove
specific proteins (features) from the data frame.
For example, you can
choose to remove a protein from the model_df
object if the protein
does not show distinct patterns of variation among conditions. This protein
may show mostly overlapping distributions in the feature plots.
Another incidence would be removing a protein that is very low in
variable importance in the models built using train_models
. You can
visualize variable importance using varimp_plot
.
A model_df
object.
Chathurani Ranathunge
covid_model_df <- pre_process(fit_df = covid_fit_df, norm_df = covid_norm_df) ## Remove sp|P22352|GPX3_HUMAN protein from the model_df object covid_model_df1 <- rem_feature(covid_model_df, rem_protein = "sp|P22352|GPX3_HUMAN")
covid_model_df <- pre_process(fit_df = covid_fit_df, norm_df = covid_norm_df) ## Remove sp|P22352|GPX3_HUMAN protein from the model_df object covid_model_df1 <- rem_feature(covid_model_df, rem_protein = "sp|P22352|GPX3_HUMAN")
This function removes user-specified samples from the data frame.
rem_sample(raw_df, rem)
rem_sample(raw_df, rem)
raw_df |
A |
rem |
Name of the sample to remove. |
rem_sample
assumes that sample names follow the
"Group_UniqueSampleID_TechnicalReplicate" notation (Use head(raw_df)
to see the structure of the raw_df
object.)
If all the technical replicates representing a sample needs to be
removed, provide "Group_UniqueSampleID" as rem
.
If a specific technical replicate needs to be removed in case it
shows weak correlation with other technical replicates for example, you can
remove that particular technical replicate by providing
"Group_UniqueSampleID_TechnicalReplicate" as rem
.
A raw_df
object.
Chathurani Ranathunge
## Use a data set containing technical replicates to create a raw_df object raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", tech_reps = TRUE ) # Check the first few rows of the raw_df object head(raw_df) ## Remove all technical replicates of "WT_4" raw_df1 <- rem_sample(raw_df, "WT_4") ## Remove only technical replicate number 2 of "WT_4" raw_df2 <- rem_sample(raw_df, "WT_4_2")
## Use a data set containing technical replicates to create a raw_df object raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", tech_reps = TRUE ) # Check the first few rows of the raw_df object head(raw_df) ## Remove all technical replicates of "WT_4" raw_df1 <- rem_sample(raw_df, "WT_4") ## Remove only technical replicate number 2 of "WT_4" raw_df2 <- rem_sample(raw_df, "WT_4_2")
This function generates Receiver Operating Characteristic (ROC) curves to evaluate models
roc_plot( probability_list, split_df, ..., multiple_plots = TRUE, text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "ROC_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80 )
roc_plot( probability_list, split_df, ..., multiple_plots = TRUE, text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "ROC_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80 )
probability_list |
A |
split_df |
A |
... |
Additional arguments to be passed on to
|
multiple_plots |
Logical. If |
text_size |
Text size for plot labels, axis labels etc. Default is
|
palette |
Viridis color palette option for plots. Default is
|
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the plot.
Default is |
file_type |
File type to save the plot.
Default is |
plot_width |
Width of the plot. Default is |
plot_height |
Height of the plot. Default is |
dpi |
Plot resolution. Default is |
roc_plot
first uses probabilities generated
during test_models
to build a ROC object.
Next, relevant information is extracted from the ROC object to plot the ROC curves.
A ggplot2
object.
Chathurani Ranathunge
test_models
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models using the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) # Test a list of models on a test data set and output class probabilities, covid_prob_list <- test_models(covid_model_list, covid_split_df, type = "prob") ## Plot ROC curves separately for each ML algorithm roc_plot(covid_prob_list, covid_split_df) ## Plot all ROC curves in one plot roc_plot(covid_prob_list, covid_split_df, multiple_plots = FALSE) ## Change color palette roc_plot(covid_prob_list, covid_split_df, palette = "plasma")
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models using the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) # Test a list of models on a test data set and output class probabilities, covid_prob_list <- test_models(covid_model_list, covid_split_df, type = "prob") ## Plot ROC curves separately for each ML algorithm roc_plot(covid_prob_list, covid_split_df) ## Plot all ROC curves in one plot roc_plot(covid_prob_list, covid_split_df, multiple_plots = FALSE) ## Change color palette roc_plot(covid_prob_list, covid_split_df, palette = "plasma")
This function can be used to create balanced splits of the
protein intensity data in a model_df
object to create training and test data
split_data(model_df, train_size = 0.8, seed = NULL)
split_data(model_df, train_size = 0.8, seed = NULL)
model_df |
A |
train_size |
The size of the training data set as a proportion of the complete data set. Default is 0.8. |
seed |
Numerical. Random number seed. Default is |
This function splits the model_df
object in to training and
test data sets using random sampling while preserving the original
class distribution of the data. Make sure to fix the random number seed with
seed
for reproducibility
A list of data frames.
Chathurani Ranathunge
pre_process
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets using default settings covid_split_df1 <- split_data(covid_model_df, seed = 8314) ## Split the data frame into training and test data sets with 70% of the ## data in training and 30% in test data sets covid_split_df2 <- split_data(covid_model_df, train_size = 0.7, seed = 8314) ## Access training data set covid_split_df1$training ## Access test data set covid_split_df1$test
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets using default settings covid_split_df1 <- split_data(covid_model_df, seed = 8314) ## Split the data frame into training and test data sets with 70% of the ## data in training and 30% in test data sets covid_split_df2 <- split_data(covid_model_df, train_size = 0.7, seed = 8314) ## Access training data set covid_split_df1$training ## Access test data set covid_split_df1$test
This function can be used to predict test data using models generated by different machine learning algorithms
test_models( model_list, split_df, type = "prob", save_confusionmatrix = FALSE, file_path = NULL, ... )
test_models( model_list, split_df, type = "prob", save_confusionmatrix = FALSE, file_path = NULL, ... )
model_list |
A |
split_df |
A |
type |
Type of output. Set |
save_confusionmatrix |
Logical. If |
file_path |
A string containing the directory path to save the file. |
... |
Additional arguments to be passed on to
|
test_models
function uses
models obtained from train_models
to predict a given test data set.
Setting type = "raw"
is required to obtain confusion matrices.
Setting type = "prob"
(default) will output a list of
probabilities that can be used to generate ROC curves using roc_plot
.
probability_list
: If type = "prob"
, a list of
data frames containing class probabilities for each method in the
model_list
will be returned.
prediction_list
: If type = "raw"
, a list of factors
containing class predictions for each method will be returned.
Chathurani Ranathunge
split_df
train_models
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models using the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) # Test a list of models on a test data set and output class probabilities, covid_prob_list <- test_models(model_list = covid_model_list, split_df = covid_split_df) ## Not run: # Save confusion matrices in the working directory and output class predictions covid_pred_list <- test_models( model_list = covid_model_list, split_df = covid_split_df, type = "raw", save_confusionmatrix = TRUE, file_path = "." ) ## End(Not run)
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models using the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) # Test a list of models on a test data set and output class probabilities, covid_prob_list <- test_models(model_list = covid_model_list, split_df = covid_split_df) ## Not run: # Save confusion matrices in the working directory and output class predictions covid_pred_list <- test_models( model_list = covid_model_list, split_df = covid_split_df, type = "raw", save_confusionmatrix = TRUE, file_path = "." ) ## End(Not run)
This function can be used to train models on protein intensity data using different machine learning algorithms
train_models( split_df, resample_method = "repeatedcv", resample_iterations = 10, num_repeats = 3, algorithm_list, seed = NULL, ... )
train_models( split_df, resample_method = "repeatedcv", resample_iterations = 10, num_repeats = 3, algorithm_list, seed = NULL, ... )
split_df |
A |
resample_method |
The resampling method to use. Default is
|
resample_iterations |
Number of resampling iterations. Default is
|
num_repeats |
The number of complete sets of folds to compute (For
|
algorithm_list |
A list of classification or regression algorithms to
use.
A full list of machine learning algorithms available through
the |
seed |
Numerical. Random number seed. Default is |
... |
Additional arguments to be passed on to
|
train_models
function can be used to first
define the control parameters to be used in training models, calculate
resampling-based performance measures for models based on a given set of
machine-learning algorithms, and output the best model for each algorithm.
In the event that algorithm_list
is not provided, a default
list of four classification-based machine-learning algorithms will be used
for building and training models. Default algorithm_list
:
"svmRadial", "rf", "glm", "xgbLinear, and "naive_bayes."
Note: Models that fail to build are removed from the output.
Make sure to fix the random number seed with
seed
for reproducibility
A list of class train
for each machine-learning algorithm.
See train
for more information on accessing
different elements of this list.
Chathurani Ranathunge
Kuhn, Max. "Building predictive models in R using the caret package." Journal of statistical software 28 (2008): 1-26.
pre_process
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df, seed = 8314) ## Fit models based on the default list of machine learning (ML) algorithms covid_model_list1 <- train_models(split_df = covid_split_df, seed = 351) ## Fit models using a user-specified list of ML algorithms. covid_model_list2 <- train_models( covid_split_df, algorithm_list = c("svmRadial", "glmboost"), seed = 351 ) ## Change resampling method and resampling iterations. covid_model_list3 <- train_models( covid_split_df, resample_method = "cv", resample_iterations = 50, seed = 351 )
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df, seed = 8314) ## Fit models based on the default list of machine learning (ML) algorithms covid_model_list1 <- train_models(split_df = covid_split_df, seed = 351) ## Fit models using a user-specified list of ML algorithms. covid_model_list2 <- train_models( covid_split_df, algorithm_list = c("svmRadial", "glmboost"), seed = 351 ) ## Change resampling method and resampling iterations. covid_model_list3 <- train_models( covid_split_df, resample_method = "cv", resample_iterations = 50, seed = 351 )
This function visualizes variable importance in models
varimp_plot( model_list, ..., type = "lollipop", text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "VarImp_plot", file_type = "pdf", dpi = 80, plot_width = 7, plot_height = 7 )
varimp_plot( model_list, ..., type = "lollipop", text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "VarImp_plot", file_type = "pdf", dpi = 80, plot_width = 7, plot_height = 7 )
model_list |
A |
... |
Additional arguments to be passed on to
|
type |
Type of plot to generate. Choices are "bar" or "lollipop."
Default is |
text_size |
Text size for plot labels, axis labels etc. Default is
|
palette |
Viridis color palette option for plots. Default is
|
n_row |
Number of rows to print the plots. |
n_col |
Number of columns to print the plots. |
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the plot.
Default is |
file_type |
File type to save the plot.
Default is |
dpi |
Plot resolution. Default is |
plot_width |
Width of the plot. Default is |
plot_height |
Height of the plot. Default is |
varimp_plot
produces a list of plots showing
variable importance measures calculated from models generated with different
machine-learning algorithms.
Note: Variables are ordered by variable importance in
descending order, and by default, importance values are scaled to 0 and 100.
This can be changed by specifying scale = FALSE
. See
varImp
for more information.
A list of ggplot2
objects.
Chathurani Ranathunge
train_models
, rem_feature
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models based on the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) ## Variable importance - lollipop plots varimp_plot(covid_model_list) ## Bar plots varimp_plot(covid_model_list, type = "bar") ## Do not scale variable importance values varimp_plot(covid_model_list, scale = FALSE) ## Change color palette varimp_plot(covid_model_list, palette = "magma")
## Create a model_df object covid_model_df <- pre_process(covid_fit_df, covid_norm_df) ## Split the data frame into training and test data sets covid_split_df <- split_data(covid_model_df) ## Fit models based on the default list of machine learning (ML) algorithms covid_model_list <- train_models(covid_split_df) ## Variable importance - lollipop plots varimp_plot(covid_model_list) ## Bar plots varimp_plot(covid_model_list, type = "bar") ## Do not scale variable importance values varimp_plot(covid_model_list, scale = FALSE) ## Change color palette varimp_plot(covid_model_list, palette = "magma")
This function generates volcano plots to visualize differentially expressed proteins between groups.
volcano_plot( fit_df, adj_method = "BH", sig = "adjP", cutoff = 0.05, lfc = 1, line_fc = TRUE, line_p = TRUE, palette = "viridis", text_size = 10, label_top = FALSE, n_top = 10, save = FALSE, file_path = NULL, file_name = "Volcano_plot", file_type = "pdf", plot_height = 7, plot_width = 7, dpi = 80 )
volcano_plot( fit_df, adj_method = "BH", sig = "adjP", cutoff = 0.05, lfc = 1, line_fc = TRUE, line_p = TRUE, palette = "viridis", text_size = 10, label_top = FALSE, n_top = 10, save = FALSE, file_path = NULL, file_name = "Volcano_plot", file_type = "pdf", plot_height = 7, plot_width = 7, dpi = 80 )
fit_df |
A |
adj_method |
Method used for adjusting the p-values for multiple
testing. Default is |
sig |
Criteria to denote significance. Choices are |
cutoff |
Cutoff value for p-values and adjusted p-values. Default is 0.05. |
lfc |
Minimum absolute log2-fold change to use as threshold for differential expression. |
line_fc |
Logical. If |
line_p |
Logical. If |
palette |
Viridis color palette option for plots. Default is
|
text_size |
Text size for axis text, labels etc. |
label_top |
Logical. If |
n_top |
The number of top hits to label with protein name when
|
save |
Logical. If |
file_path |
A string containing the directory path to save the file. |
file_name |
File name to save the plot. Default is "Volcano_plot." |
file_type |
File type to save the plot. Default is |
plot_height |
Height of the plot. Default is 7. |
plot_width |
Width of the plot. Default is 7. |
dpi |
Plot resolution. Default is |
Volcano plots show log-2-fold change on the x-axis, and based on the significance criteria chosen, either -log10(p-value) or -log10(adjusted p-value) on the y-axis.
volcano_plot
requires a fit_df
object from performing
differential expression analysis with find_dep.
User has the option to choose criteria that denote significance.
A ggplot2
plot object.
Chathurani Ranathunge
## Create a volcano plot with default settings. volcano_plot(ecoli_fit_df) ## Change significance criteria and cutoff volcano_plot(ecoli_fit_df, cutoff = 0.1, sig = "P") ## Label top 30 differentially expressed proteins and ## change the color palette of the plot volcano_plot(ecoli_fit_df, label_top = TRUE, n_top = 30, palette = "mako")
## Create a volcano plot with default settings. volcano_plot(ecoli_fit_df) ## Change significance criteria and cutoff volcano_plot(ecoli_fit_df, cutoff = 0.1, sig = "P") ## Label top 30 differentially expressed proteins and ## change the color palette of the plot volcano_plot(ecoli_fit_df, label_top = TRUE, n_top = 30, palette = "mako")