Dear all,
first I would like to tell you that I've been using R for two days... (so,
you can predict my knowledge of the language!).
Yet, I managed to implement some stuff related with the Long-Tail model [1].
I did some tests with the data in table 1 (from [1]), and plotted figure 2
(from [1]). (See R code and CSV file at the end of the email)
Now, I'm stuck in the nonlinear regression model of F(x). I got a nice
error:
"
Error in nls(~F(r, N50, beta, alfa), data = dataset, start = list(N50 N50, :
singular gradient
"
And, yes, I've been looking for how to solve this (via this mailing list +
some google), and I could not come across to a proper solution. That's why
I am asking the experts to help me! :-)
So, any help would be much appreciated...
Cheers, Oscar
[1] http://www.firstmonday.org/issues/issue12_5/kilkki/
PS: R code and CVS file
FILE: "data.R" (data taken from [1] Table 1, columns 1 and 2)
--8=<-------------------
"rank","cum_value"
10, 17396510
32, 31194809
96, 53447300
420, 100379331
1187, 152238166
24234, 432238757
91242, 581332371
294180, 650880870
1242185,665227287
-->=8-------------------
R CODE:
#
# F(x). The long-tail model
# Reference: http://www.firstmonday.org/issues/issue12_5/kilkki/
# Params:
# x : Rank (either an integer or a list)
# N50 : the number of objects that cover half of the whole volume
# beta: total volume
# alfa: the factor that defines the form of the function
F <- function (x, N50, beta=1.0, alfa=0.49)
{
xx <- as.numeric(x) # as.numeric() prevents overflow
Fx = beta / ( (N50/xx)^alfa + 1 )
Fx
}
# Read CSV file (rank, cum_value)
lt <- read.csv(file="data.R",head=TRUE,sep=",")
r <- lt$rank
v <- lt$cum_value
pcnt <- v/v[length(v)] *100 # get cumulative percentage
plot(r, pcnt, log="x", type='l', xlab='Ranking',
ylab='Cumulative
percentatge of sales', main="Books Popularity", sub="The
long-tail
effect", col='blue')
# Set some default values to be used by F(x)...
alfa = 0.49
beta = 1.38
N50 = 30714
# Start using F(x). Results are in 'f' ...
f <- c(0) # oops! is this the best initialization for 'f'?
for (i in 1:24234) f[i] <- F(i, N50, beta, alfa)*100
# Plot some estimated values from F(x) (N50, beta, and alfa values come
from the paper. See ref. [1])
plot(f, log="x", type='l', xlab='Ranking',
ylab='Cumulative percentatge of
sales', main="Books Popularity", sub="Plotting first values
of F(x) and
some real points")
points(r, pcnt, col="blue") # adding the "real" points
# Create a dataset to be used by nls()
dataset <- data.frame(r, pcnt)
# Verifying that F(x) works fine... (comparing with the "real" values
contained in the dataset)
dataset
F(10, N50, beta, alfa) * 100
F(32, N50, beta, alfa) * 100
F(96, N50, beta, alfa) * 100
F(420, N50, beta, alfa) * 100
F(1187, N50, beta, alfa) * 100
F(24234, N50, beta, alfa) * 100
F(91242, N50, beta, alfa) * 100
F(294180, N50, beta, alfa) * 100
F(1242185, N50, beta, alfa) * 100
#dataset <- data.frame(pcnt) # which dataset should I use? Should I
include the ranks in it?
nls( ~ F(r, N50, beta, alfa), data = dataset, start = list(N50=N50,
beta=beta, alfa=alfa), trace = TRUE )