concept lognormal distribution in category R

This is an excerpt from Manning's book Practical Data Science with R, Second Edition.
1Recall from the discussion of the lognormal distribution in section 4.2 that it’s often useful to log transform monetary quantities. The log transform is also compatible with our original task of predicting incomes with a relative error (meaning large errors count more against small incomes). The glm() methods of section 7.2 can be used to avoid the log transform and predict in such a way as to minimize square errors (so being off by $50,000 would be considered the same error for both large and small incomes).
The lognormal distribution is the distribution of a random variable X whose natural log log(X) is normally distributed. The distribution of highly skewed positive data, like the value of profitable customers, incomes, sales, or stock prices, can often be modeled as a lognormal distribution. A lognormal distribution is defined over all non-negative real numbers; as shown in figure B.4 (top), it’s asymmetric, with a long tail out toward positive infinity. The distribution of log(X) (figure B.4, bottom) is a normal distribution centered at mean(log(X)). For lognormal populations, the mean is generally much higher than the median, and the bulk of the contribution toward the mean value is due to a small population of highest-valued data points.
Let’s look at the functions for working with the lognormal distribution in R (see also section B.5.3). We’ll start with dlnorm() and rlnorm():
dlnorm(x, meanlog = m, sdlog = s) is the probability density function (PDF) that returns the probability of observing the value x when it’s drawn from a lognormal distribution X such that mean(log(X)) = m and sd(log(X)) = s. By default, meanlog = 0 and sdlog = 1 for all the functions discussed in this section. rlnorm(n, meanlog = m, sdlog = s) is the random number that returns n values drawn from a lognormal distribution with mean(log(X)) = m and sd(log(X)) = s. We can use dlnorm() and rlnorm() to produce figure 8.4, shown earlier. The following listing demonstrates some properties of the lognormal distribution.
Listing B.5. Demonstrating some properties of the lognormal distribution
# draw 1001 samples from a lognormal with meanlog 0, sdlog 1 u <- rlnorm(1001) # the mean of u is higher than the median mean(u) # [1] 1.638628 median(u) # [1] 1.001051 # the mean of log(u) is approx meanlog=0 mean(log(u)) # [1] -0.002942916 # the sd of log(u) is approx sdlog=1 sd(log(u)) # [1] 0.9820357 # generate the lognormal with meanlog = 0, sdlog = 1 x <- seq(from = 0, to = 25, length.out = 500) f <- dlnorm(x) # generate a normal with mean = 0, sd = 1 x2 <- seq(from = -5, to = 5, length.out = 500) f2 <- dnorm(x2) # make data frames lnormframe <- data.frame(x = x, y = f) normframe <- data.frame(x = x2, y = f2) dframe <- data.frame(u=u) # plot densityplots with theoretical curves superimposed p1 <- ggplot(dframe, aes(x = u)) + geom_density() + geom_line(data = lnormframe, aes(x = x, y = y), linetype = 2) p2 <- ggplot(dframe, aes(x = log(u))) + geom_density() + geom_line(data = normframe, aes(x = x,y = y), linetype = 2) # functions to plot multiple plots on one page library(grid) nplot <- function(plist) { n <- length(plist) grid.newpage() pushViewport(viewport(layout=grid.layout(n, 1))) vplayout<- function(x,y) { viewport(layout.pos.row = x, layout.pos.col = y) } for(i in 1:n) { print(plist[[i]], vp = vplayout(i, 1)) } } # this is the plot that leads this section. nplot(list(p1, p2))