A couple of pointers:
a) the result of read.table is not a matrix but a data frame, so you need
to ensure you have a matrix. as.matrix would help, but I suspect you
would do better to use scan() if the matrix is at all large.
b) You seem to be describing a skeletal version of principal components
analysis. PCA will give as many PCs as input dimensions unless told
otherwise. Reading up about PCA (and feature extraction methods more
generally) would be helpful, I think. PCA can `predict' new points.
On Mon, 18 Nov 2002, Corrin Lakeland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi all, this is probably simple and I'm just doing something stupid,
sorry
> about that :-)
>
> I'm trying to convert words (strings of letters) into a fairly small
> dimensional space (say 10, but anything between about 5 and 50 would be
ok),
> which I will call a feature vector. The the distance between two words
> represents the similarity of the contexts of the words, so big and little
> have very similar contexts and should get a similar representation.
Basically
> to build something similar to a thesaurus.
>
> I have computed bigram counts between the n most common words, for varying
> values of n between 500 and 5000. These are saved to a file which I can
load
> with read.table. This matrix is symmetric and far from sparse, although I
> can adjust the sparseness by changing the bigram window. First question,
> should I scale the counts? The angle is all that is really important,
I'd
> like 1,1,1,2 to be basically the same as 2,2,2,4, perhaps with that the
> latter having more weight in resolving disrepencies.
>
> Next is the job of reducing the matrix from 500 dimensions to say 10. I
think
> the correct way of doing this is using SVD, does that sound right? At
least,
> I have read a paper by Schuetze which used SVD. Other algorithms (K-means,
> SOM) also sound applicable but may balk at the amount of data, or might not
> provide the distance property I'm trying to get.
>
> However I must be doing something stupid here because the result I get from
> SVD has n dimensions instead of k. Firstly, I don't seem to be able to
use
> La.svd at all, and for normal svd I'm not getting the results I expect.
>
> > x <- read.table("bigram.500")
> > xs <- La.svd(x)
> Error in La.svd(x) : argument to La.svd must be numeric or complex
>
> > xs <- svd(x)
> > ncol(xs$v)
> [1] 500
> > nrow(xs$v)
> [1] 500
> > nrow(xs$u)
> [1] 500
> > ncol(xs$u)
> [1] 500
>
> Also, how should I locate the million or so less common words into the
space
> generated by this? Running svd on the full bigrams sounds infeasable, it
> would be a 200GB matrix, for a start. Really I just want to
'predict' their
> location rather than build the classifier with a larger set.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._