thr3ads.net - R help - [R] Long-tail model in R ... anyone? [Jul 2007]

If this information is useful, please help other people find it:
Share via:

ocelma at iua.upf.edu

2007-Jul-04 17:25 UTC

[R] Long-tail model in R ... anyone?

Dear all,

first I would like to tell you that I've been using R for two days... (so,
you can predict my knowledge of the language!).

Yet, I managed to implement some stuff related with the Long-Tail model [1].
I did some tests with the data in table 1 (from [1]), and plotted figure 2
(from [1]). (See R code and CSV file at the end of the email)

Now, I'm stuck in the nonlinear regression model of F(x). I got a nice
error:
"
Error in nls(~F(r, N50, beta, alfa), data = dataset, start = list(N50 N50,  :
singular gradient
"

And, yes, I've been looking for how to solve this (via this mailing list +
some google), and I could not come across to a proper solution. That's why
I am asking the experts to help me! :-)

So, any help would be much appreciated...

Cheers, Oscar
[1] http://www.firstmonday.org/issues/issue12_5/kilkki/

PS: R code and CVS file

FILE: "data.R" (data taken from [1] Table 1, columns 1 and 2)
--8=<-------------------
"rank","cum_value"
10,     17396510
32,     31194809
96,     53447300
420,    100379331
1187,   152238166
24234,  432238757
91242,  581332371
294180, 650880870
1242185,665227287
-->=8-------------------

R CODE:

#
# F(x). The long-tail model
# Reference: http://www.firstmonday.org/issues/issue12_5/kilkki/
# Params:
#       x   :   Rank (either an integer or a list)
#       N50 :   the number of objects that cover half of the whole volume
#       beta:   total volume
#       alfa:   the factor that defines the form of the function
F <- function (x, N50, beta=1.0, alfa=0.49)
{
        xx <- as.numeric(x) # as.numeric() prevents overflow
        Fx = beta / ( (N50/xx)^alfa + 1 )
        Fx
}

# Read CSV file (rank, cum_value)
lt <- read.csv(file="data.R",head=TRUE,sep=",")

r <- lt$rank
v <- lt$cum_value
pcnt <- v/v[length(v)] *100 # get cumulative percentage
plot(r, pcnt, log="x", type='l', xlab='Ranking',
ylab='Cumulative
percentatge of sales', main="Books Popularity", sub="The
long-tail
effect", col='blue')

# Set some default values to be used by F(x)...
alfa = 0.49
beta = 1.38
N50 = 30714

# Start using F(x). Results are in 'f' ...
f <- c(0) # oops! is this the best initialization for 'f'?
for (i in 1:24234) f[i] <- F(i, N50, beta, alfa)*100

# Plot some estimated values from F(x) (N50, beta, and alfa values come
from the paper. See ref. [1])
plot(f, log="x", type='l', xlab='Ranking',
ylab='Cumulative percentatge of
sales', main="Books Popularity", sub="Plotting first values
of F(x) and
some real points")
points(r, pcnt, col="blue") # adding the "real" points

# Create a dataset to be used by nls()
dataset <- data.frame(r, pcnt)

# Verifying that F(x) works fine... (comparing with the "real" values
contained in the dataset)

dataset
F(10, N50, beta, alfa) * 100
F(32, N50, beta, alfa) * 100
F(96, N50, beta, alfa) * 100
F(420, N50, beta, alfa) * 100
F(1187, N50, beta, alfa) * 100
F(24234, N50, beta, alfa) * 100
F(91242, N50, beta, alfa) * 100
F(294180, N50, beta, alfa) * 100
F(1242185, N50, beta, alfa) * 100

#dataset <- data.frame(pcnt) # which dataset should I use? Should I
include the ranks in it?
nls( ~ F(r, N50, beta, alfa), data = dataset, start = list(N50=N50,
beta=beta, alfa=alfa), trace = TRUE )

Dirk Eddelbuettel

2007-Jul-04 19:15 UTC

head link

[R] Long-tail model in R ... anyone?

I think you simply had your nls() syntax wrong.  Works here:


## first a neat trick to read the data from embedded
text> fmdata <- read.csv(textConnection("+ rank,cum_value
10,     17396510
32,     31194809
96,     53447300
420,    100379331
1187,   152238166
24234,  432238757
91242,  581332371
294180, 650880870
1242185,665227287"))> 

## then compute cumulative share> fmdata[,"cumshare"] <- fmdata[,"cum_value"] /
fmdata[nrow(fmdata),"cum_value"]
> 

## then check the data, just in case> summary(fmdata)      rank           cum_value            cumshare      
 Min.   :     10   Min.   : 17396510   Min.   :0.02615  
 1st Qu.:     96   1st Qu.: 53447300   1st Qu.:0.08034  
 Median :   1187   Median :152238166   Median :0.22885  
 Mean   : 183732   Mean   :298259489   Mean   :0.44836  
 3rd Qu.:  91242   3rd Qu.:581332371   3rd Qu.:0.87389  
 Max.   :1242185   Max.   :665227287   Max.   :1.00000  > 
## finally estimate the model, using only the first seven rows of data
## using the parametric form from the paper and some wild guesses as
## starting values:> fit <- nls(cumshare ~ Beta / ((N50 / rank)^Alpha + 1),
data=fmdata[1:7,], start=list(Alpha=1, Beta=1, N50=1e4))
> summary(fit)
Formula: cumshare ~ Beta/((N50/rank)^Alpha + 1)

Parameters:
       Estimate Std. Error t value Pr(>|t|)    
Alpha 4.829e-01  5.374e-03   89.86 9.20e-08 ***
Beta  1.429e+00  2.745e-02   52.07 8.14e-07 ***
N50   3.560e+04  3.045e+03   11.69 0.000306 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

Residual standard error: 0.002193 on 4 degrees of freedom

Number of iterations to convergence: 8 
Achieved convergence tolerance: 1.297e-06 
> 
which is reasonably close to the quoted 
	N50 = 30714, ? = 0.49, and ? = 1.38.

You can probably play a little with the nls options to see what effect this
has. 

That said, seven observations for three parameters in non-linear model may be
a little hazardous.  One indication is that the estimated parameters values
are not too stable once you add the eights and nineth row of data.

Dirk

-- 
Hell, there are no rules here - we're trying to accomplish something. 
                                                  -- Thomas A. Edison

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Jul 2007 - Long-tail model in R ... anyone?

[R] Long-tail model in R ... anyone?

[R] Long-tail model in R ... anyone?

Possibly Parallel Threads