Hello I am struggling with data frames and would appreciate some help please. I have a data set of 13 observations and 80 variables. The first column is the names of different political area boundaries (e.g. MHad, LBNW, etc), the first row is a vector of variable names concerning various census data (e.g. age.T, hse.Unk, etc.). The first cell [1,1] is blank. I have loaded this via read.csv('path.to/data.set.csv'), and now want to run some analyses on this data frame. If I want to get a list of the names of the political areas (i.e. the first column), the result is a vector of numbers which appear to correlate with the factors, but I don't get the text names, just the corresponding number. So, if I want to plot something basic, like the area that uses the most gas for central heating, for example: > plot(data.set$ch.Gas) The result is the y-axis gives the gas usage for the areas, but the x-axis gives only the numbers of the areas, not the names of the areas (which is preferred). So, two questions: (1) have I set up my csv file correctly to be read as a data frame as the first row of all of the remaining columns with the values for that political area in the corresponding row in the column with the specific variable name? So far, looking through tutorials and books seems to suggest yes, but at this point I'm no longer sure. (2) How can I access the names of the political areas when plotting so that these are given on the x-axis instead of the numbers? Thanks for any help. Cheers Sun
Hi Sun, If I understood correctly (a reproducible example would be of great help), it seems you're struggling with factors. Read on this topic to better understand how it works. For your plots, you would need to set the labels with the argument 'xlab' for plot(). To access the names of the factors, use levels() HTH, Ivan -- Ivan Calandra, ATER University of Reims Champagne-Ardenne GEGENA? - EA 3795 CREA - 2 esplanade Roland Garros 51100 Reims, France +33(0)3 26 77 36 89 ivan.calandra at univ-reims.fr https://www.researchgate.net/profile/Ivan_Calandra Le 11/12/14 14:00, Sun Shine a ?crit :> Hello > > I am struggling with data frames and would appreciate some help please. > > I have a data set of 13 observations and 80 variables. The first > column is the names of different political area boundaries (e.g. MHad, > LBNW, etc), the first row is a vector of variable names concerning > various census data (e.g. age.T, hse.Unk, etc.). The first cell [1,1] > is blank. > > I have loaded this via read.csv('path.to/data.set.csv'), and now want > to run some analyses on this data frame. If I want to get a list of > the names of the political areas (i.e. the first column), the result > is a vector of numbers which appear to correlate with the factors, but > I don't get the text names, just the corresponding number. So, if I > want to plot something basic, like the area that uses the most gas for > central heating, for example: > > > plot(data.set$ch.Gas) > > The result is the y-axis gives the gas usage for the areas, but the > x-axis gives only the numbers of the areas, not the names of the areas > (which is preferred). > > So, two questions: > > (1) have I set up my csv file correctly to be read as a data frame as > the first row of all of the remaining columns with the values for that > political area in the corresponding row in the column with the > specific variable name? So far, looking through tutorials and books > seems to suggest yes, but at this point I'm no longer sure. > > (2) How can I access the names of the political areas when plotting so > that these are given on the x-axis instead of the numbers? > > Thanks for any help. > > Cheers > Sun > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
If you are using 'read.csv' (or 'read.table') to input, then use the 'as.is = TRUE' parameter to prevent the conversion to factors of the data. You can also do "as.character(df$col_with_factors)" to get the character values back. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Thu, Dec 11, 2014 at 8:00 AM, Sun Shine <phaedrusv at gmail.com> wrote:> Hello > > I am struggling with data frames and would appreciate some help please. > > I have a data set of 13 observations and 80 variables. The first column is > the names of different political area boundaries (e.g. MHad, LBNW, etc), > the first row is a vector of variable names concerning various census data > (e.g. age.T, hse.Unk, etc.). The first cell [1,1] is blank. > > I have loaded this via read.csv('path.to/data.set.csv'), and now want to > run some analyses on this data frame. If I want to get a list of the names > of the political areas (i.e. the first column), the result is a vector of > numbers which appear to correlate with the factors, but I don't get the > text names, just the corresponding number. So, if I want to plot something > basic, like the area that uses the most gas for central heating, for > example: > > > plot(data.set$ch.Gas) > > The result is the y-axis gives the gas usage for the areas, but the x-axis > gives only the numbers of the areas, not the names of the areas (which is > preferred). > > So, two questions: > > (1) have I set up my csv file correctly to be read as a data frame as the > first row of all of the remaining columns with the values for that > political area in the corresponding row in the column with the specific > variable name? So far, looking through tutorials and books seems to suggest > yes, but at this point I'm no longer sure. > > (2) How can I access the names of the political areas when plotting so > that these are given on the x-axis instead of the numbers? > > Thanks for any help. > > Cheers > Sun > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Here is a reproducible example > d <- read.csv(text="Name,Age\nBob,2\nXavier,25\nAdam,1") > str(d) 'data.frame': 3 obs. of 2 variables: $ Name: Factor w/ 3 levels "Adam","Bob","Xavier": 2 3 1 $ Age : int 2 25 1 Do you get something similar? If not, show us what you have (you could trim it down to a few columns). Let's try some plots. > plot(d$Age) This shows a plot of d$Age (on y axis) vs "Index", where Index is 1:length(d$Age). The points are at (1,2), (2,25), and (3,1). You gave plot() no information about what should be on the x axis so it gave you the index numbers. Now asking for d$Name on the x axis and d$Age on the y. > plot(d$Name, d$Age) This put the names, in alphabetical order on the x axis. The y axis ranges from about 0 to 25 and neither axis is labelled. There are thick horizontal line segments where you expect the the points to be. These are degenerate boxplots - when you ask to plot a 'factor' variable on the x axis and numbers on the y you get such a plot. Some folks suggested you avoid factors by adding stringsAsFactors=FALSE (or as.is=TRUE) to your call to read.csv. Let's try that > d2 <- read.csv(stringsAsFactors=FALSE, text="Name,Age\nBob,2\nXavier,25\nAdam,1") > plot(d2$Name, d2$Age) Error in plot.window(...) : need finite 'xlim' values In addition: Warning messages: 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion 2: In min(x) : no non-missing arguments to min; returning Inf 3: In max(x) : no non-missing arguments to max; returning -Inf You get no plot at all. You can get closer to what I think you want with with(d, { plot(as.integer(Name), Age, axes=FALSE, xlab="Name") axis(side=2) # draw the usual y axis axis(side=1, at=seq_along(levels(Name)), lab=levels(Name)) }) If you want the names in a different order on the x axis, then reconstruct the factor object d$Name with a different order of levels. E.g., d$Name <- factor(d$Name, levels=c("Xavier", "Bob", "Adam")) and replot. There are various plotting packages, e.g., ggplot2, that can make this sort of thing easier, but I think the recommendation not to use factors is wrong. You do need to learn how to use them to your advantage. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Dec 11, 2014 at 5:00 AM, Sun Shine <phaedrusv at gmail.com> wrote:> Hello > > I am struggling with data frames and would appreciate some help please. > > I have a data set of 13 observations and 80 variables. The first column is > the names of different political area boundaries (e.g. MHad, LBNW, etc), > the first row is a vector of variable names concerning various census data > (e.g. age.T, hse.Unk, etc.). The first cell [1,1] is blank. > > I have loaded this via read.csv('path.to/data.set.csv'), and now want to > run some analyses on this data frame. If I want to get a list of the names > of the political areas (i.e. the first column), the result is a vector of > numbers which appear to correlate with the factors, but I don't get the > text names, just the corresponding number. So, if I want to plot something > basic, like the area that uses the most gas for central heating, for > example: > > > plot(data.set$ch.Gas) > > The result is the y-axis gives the gas usage for the areas, but the x-axis > gives only the numbers of the areas, not the names of the areas (which is > preferred). > > So, two questions: > > (1) have I set up my csv file correctly to be read as a data frame as the > first row of all of the remaining columns with the values for that > political area in the corresponding row in the column with the specific > variable name? So far, looking through tutorials and books seems to suggest > yes, but at this point I'm no longer sure. > > (2) How can I access the names of the political areas when plotting so > that these are given on the x-axis instead of the numbers? > > Thanks for any help. > > Cheers > Sun > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hello William, Ivan and Jim I appreciate your replies. I did suppress the factors using stringsAsFactors=FALSE and in that way was able to progress some more on getting a sense of the data set, so thanks for that suggestion. I had previously overlooked it. Also thanks William, I never understood what those thick line segs were - now I do. That had been about the best I could get by that point and still not with the names on the x axis. Unfortunately using William's suggestion of 'with' gave me errors: > with(MHP.def, {plot(as.integer(MHP.def$Names),cH.E, axes=FALSE, xlab='Area') axis(side=2) axis(side=1, at=seq_along(levels(MHP.def$Names)), lab=levels(MHP.def$Names))}) Error: unexpected symbol in "with(MHP.def, {plot(as.integer(MHP.def$Names), MHP.def$cH.E, axes=FALSE, xlab='Area') axis" This may have something to do with the period between cH and E or perhaps from the $ to access data from a column? I have now installed ggplot2 and with the help of the graphics cookbook will see if I can make some headway like this, at least for now. I think William's suggestion about learning to work with factors is fundamentally sound and something I will need to get my head around. For now though, I think I'll stick to exploring ggplot2 so that I can visualise this data set more easily. Thanks again. Best Sun On 11/12/14 16:06, William Dunlap wrote:> Here is a reproducible example > > d <- read.csv(text="Name,Age\nBob,2\nXavier,25\nAdam,1") > > str(d) > 'data.frame': 3 obs. of 2 variables: > $ Name: Factor w/ 3 levels "Adam","Bob","Xavier": 2 3 1 > $ Age : int 2 25 1 > > Do you get something similar? If not, show us what you have (you > could trim it down to a few columns). > > Let's try some plots. > > plot(d$Age) > This shows a plot of d$Age (on y axis) vs "Index", where Index is > 1:length(d$Age). The points are at (1,2), (2,25), and (3,1). You gave > plot() no information about what should be on the x axis so it gave > you the index numbers. > > Now asking for d$Name on the x axis and d$Age on the y. > > plot(d$Name, d$Age) > This put the names, in alphabetical order on the x axis. The y axis > ranges from about 0 to 25 and neither axis is labelled. There are > thick horizontal line segments where you expect the the points to > be. These are degenerate boxplots - when you ask to plot a > 'factor' variable on the x axis and numbers on the y you get such > a plot. > > Some folks suggested you avoid factors by adding stringsAsFactors=FALSE > (or as.is <http://as.is>=TRUE) to your call to read.csv. Let's try that > > d2 <- read.csv(stringsAsFactors=FALSE, > text="Name,Age\nBob,2\nXavier,25\nAdam,1") > > plot(d2$Name, d2$Age) > Error in plot.window(...) : need finite 'xlim' values > In addition: Warning messages: > 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion > 2: In min(x) : no non-missing arguments to min; returning Inf > 3: In max(x) : no non-missing arguments to max; returning -Inf > You get no plot at all. > > You can get closer to what I think you want with > with(d, { > plot(as.integer(Name), Age, axes=FALSE, xlab="Name") > axis(side=2) # draw the usual y axis > axis(side=1, at=seq_along(levels(Name)), lab=levels(Name)) > }) > If you want the names in a different order on the x axis, then reconstruct > the factor object d$Name with a different order of levels. E.g., > d$Name <- factor(d$Name, levels=c("Xavier", "Bob", "Adam")) > and replot. > > There are various plotting packages, e.g., ggplot2, that can make this > sort of thing easier, but I think the recommendation not to use factors > is wrong. You do need to learn how to use them to your advantage. > > Bill Dunlap > TIBCO Software > wdunlap tibco.com <http://tibco.com> > > On Thu, Dec 11, 2014 at 5:00 AM, Sun Shine <phaedrusv at gmail.com > <mailto:phaedrusv at gmail.com>> wrote: > > Hello > > I am struggling with data frames and would appreciate some help > please. > > I have a data set of 13 observations and 80 variables. The first > column is the names of different political area boundaries (e.g. > MHad, LBNW, etc), the first row is a vector of variable names > concerning various census data (e.g. age.T, hse.Unk, etc.). The > first cell [1,1] is blank. > > I have loaded this via read.csv('path.to/data.set.csv' > <http://path.to/data.set.csv%27>), and now want to run some > analyses on this data frame. If I want to get a list of the names > of the political areas (i.e. the first column), the result is a > vector of numbers which appear to correlate with the factors, but > I don't get the text names, just the corresponding number. So, if > I want to plot something basic, like the area that uses the most > gas for central heating, for example: > > > plot(data.set$ch.Gas) > > The result is the y-axis gives the gas usage for the areas, but > the x-axis gives only the numbers of the areas, not the names of > the areas (which is preferred). > > So, two questions: > > (1) have I set up my csv file correctly to be read as a data frame > as the first row of all of the remaining columns with the values > for that political area in the corresponding row in the column > with the specific variable name? So far, looking through tutorials > and books seems to suggest yes, but at this point I'm no longer sure. > > (2) How can I access the names of the political areas when > plotting so that these are given on the x-axis instead of the numbers? > > Thanks for any help. > > Cheers > Sun > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- > To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]