Skip to content

gamlss-dev/gamlss.prepdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gamlss.prepdata: Preparing Data for Distributional Regression

Overview

The purpose of this package is to provide functions to facilitate the creation of data.frame’s suitable for statistical modeling analysis and especially distributional regression models using gamlss and. gamlss2 packages.

There is a lot of information which can be gain using a preliminary data analysis. One could seeking information about the variables in the data set themselves, on outliers, on associations between variables, what type of relationships exist between the response and the explanatory variables (linear or not linear) and possible interactions between the explanatory variables. In addition, one, at this stage, could decide appropriate partitions of to maximize statistical inference. While answers to those questions are not necessarily final, at this pre-modeling stage, they could help at the next stage of the fitting process. All functions in the package can be used before the functions or are used for modeling.

Installation

The package is not yet on CRAN but can be installed from R-universe.

install.packages("gamlss.prepdata", repos = "https://gamlss-dev.R-universe.dev")

Functions

The functions for manipulation of variables are shown below

Functions Usage
data_dim() Dimensions & % of omitted observations
data_names() Names of the data
data_distrinct() Distinct values in variables
data_which(() NA’s in variables
data_str(). The class of variables etc.
data_omit() Omit all the NAs
data_char2fac() From characters to factors
data_few2fac() From few distinct obs. to factors
data_int2num() From integers to numeric
data _rm() Remove variables
data _rm1val() Remove factors with one level
data _rename() Rename variables
data _renamove() Remove variables
data _select() Select variables
data _exclude_class() Exclude a specified class
data _only_continous() Includes only continuous
data_rmNAvars() Remove variables with NA values
data_fac2num() Make factor numeric

The function for graphics are show below

Functions Usage
data_plot() Univariate plots of all variables
data_xyplot() Pairwise plots of the response against all others
data_bucket() Bucket plots of all variables
data_cor(() Pairwise correlations
data_pcor(() Pairwise partial correlations
data_void() Pairwise % of empty spaces
data_inter() Pairwise interactions
data_response() Response variable plots
data_zscores() Univariate plots using z-scores
data_outliers() Univariate detection of outliers
data_leverage() Detection of outliers in the x’s space
data_scale() Univariate scaling the x’s
data_trans_plot() Checking for univivariate transformations in the x’s

Graphics

Here some of the graphical functions examples

library("gamlss")
library("gamlss2")
library("ggplot2")
library("gamlss.ggplots")
library("gamlss.prepdata")
library("dplyr") 
packageVersion("gamlss.prepdata")
## [1] '0.1.5'
da <- data_rm(rent99, c(2, 9)) 
dim(rent99)
## [1] 3082    9
dim(da)
## [1] 3082    7

data_plot()

The function data_plot plots all the variable of the data individually. It plots;

  • the continuous variable as histograms with a density plots superimposed, (see the plots for rent and yearc below). Alternatively a dot plots can be requested, (see the example in Section 1.6).

  • the integers as needle plots, (see the plot for area below).

  • the categorical variables, as bar plots, (see the plots for location, bath kitchen and cheating below).

The message 100 % of data are saved below is the result of the function data_cut() which is use before any ggplot2 plot.

da |> data_plot()
##  100 % of data are saved, 
## that is, 3082 observations.

The function could saves the ggplot2 figures.

data_xyplot()

The functions data_xyplot() plots the response variable against each of the independent explanatory variables. It plots the continuous against continuous as scatter plots and continuous variables against categorical as box plot.

Warning

At the moment there is no provision for categorical response variables.

da |> data_xyplot(response=rent )
##  100 % of data are saved, 
## that is, 3082 observations.

The output of the function saves the ggplot2 figures.

data_bucket()

The function data_bucket can be used to identifies hight skewness and kurtosis on continuous variables in the data. Note that if the continuous variable is normaly distribueterd looking should be in the center of the figure.

data_bucket(da, response=rent )
##  100 % of data are saved, 
## that is, 3082 observations. 
##     rent     area    yearc location     bath  kitchen cheating 
##     2723      132       68        3        2        2        2

About

gamlss.prepdata: Preparing Data for Distributional Regression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •