Jose Iparraguirre D'Elia
2009-Jul-30 16:15 UTC
[R] Retrieving original data frame after repetition
Dear R users, Consider the first two columns of a data frame like this:> z[,1:2]x y 1 1 1 2 2 2 3 3 3 4 1 4 Imagine that y represents the times that the value x happens in a population. But z is not exactly a frequency table, because in z we have x=1 twice. So, the x=1 in the first line and the x=1 in the fourth are not the same, differing according to a third variable in the data frame. Now, I use the function rep() in order to obtain a vector of values of x in the population:> x.pop <- rep(x,y)> x.pop[1] 1 2 2 3 3 3 1 1 1 1 How can I go from x.pop back to z? If I use table(x.pop), I obtain a frequency table like the one below, but not z.> table(x.pop)x.pop 1 2 3 5 2 3 (I know I haven't deleted z, obviously, but I need to write a piece of code to do something very similar). Just in case anyone is wondering by now whether this is an assignment for college, etc.,-it is not. The real world problem I'm working on at the moment has to do with income distribution in Northern Ireland. I want to see how many people would leave poverty if the income of those currently below 60% median income increases by, say, £20 a week. I am working with the Family Resources Survey sample for Northern Ireland (n=2,263), which I have to gross up before increasing the incomes (grossed up n=1,712,886). Once I increased the income figures for those individuals in poverty, I need to 'un-gross' the data to get back to n=2,263 -and table() simply does not do the trick, because of exactly the same situation in the example above. So, please, how can I retrieve z? Many thanks, Jose Mr José Luis Iparraguirre Senior Research Economist Economic Research Institute of Northern Ireland 2 -14 East Bridge Street Belfast BT1 3NQ Northern Ireland United Kingdom Tel: +44 (0)28 9072 7365 [[alternative HTML version deleted]]
On Jul 30, 2009, at 11:15 AM, Jose Iparraguirre D'Elia wrote:> Dear R users, > > Consider the first two columns of a data frame like this: > > z[,1:2] > > x y > > 1 1 1 > > 2 2 2 > > 3 3 3 > > 4 1 4 > > > > Imagine that y represents the times that the value x happens in a > population. But z is not exactly a frequency table, because in z we > have x=1 twice. So, the x=1 in the first line and the x=1 in the > fourth are not the same, differing according to a third variable in > the data frame. > > Now, I use the function rep() in order to obtain a vector of values > of x in the population: > > x.pop <- rep(x,y) > >> x.pop > > [1] 1 2 2 3 3 3 1 1 1 1 > > How can I go from x.pop back to z? If I use table(x.pop), I obtain a > frequency table like the one below, but not z. > > table(x.pop) > > x.pop > > 1 2 3 > > 5 2 3 > > > (I know I haven't deleted z, obviously, but I need to write a piece > of code to do something very similar). > > Just in case anyone is wondering by now whether this is an > assignment for college, etc.,-it is not. The real world problem I'm > working on at the moment has to do with income distribution in > Northern Ireland. I want to see how many people would leave poverty > if the income of those currently below 60% median income increases > by, say, ?20 a week. I am working with the Family Resources Survey > sample for Northern Ireland (n=2,263), which I have to gross up > before increasing the incomes (grossed up n=1,712,886). Once I > increased the income figures for those individuals in poverty, I > need to 'un-gross' the data to get back to n=2,263 -and table() > simply does not do the trick, because of exactly the same situation > in the example above. > > So, please, how can I retrieve z? > > Many thanks, > > JosePresuming that your larger case is similar in structure to 'x.pop', which is to say that each unique value is in sequential runs, you can use: z <- do.call(data.frame, rle(x.pop))[, c(2, 1)] colnames(z) <- c("x", "y") > z x y 1 1 1 2 2 2 3 3 3 4 1 4 See ?rle for more information on summarizing runs of values. The core of the first step above yields: > rle(x.pop) Run Length Encoding lengths: int [1:4] 1 2 3 4 values : num [1:4] 1 2 3 1 which is a list of two elements, that we coerce to a data frame using do.call(), reversing the two columns to match your original order. HTH, Marc Schwartz
Jose Iparraguirre D'Elia
2009-Jul-31 10:52 UTC
[R] Retrieving original data frame after repetition
Hi Marc (et al) I've spoken too soon... Please, have a look at this chunk of real world data. The data frame a below contains the first ten records (and first two columns) of a survey dataset. It reads as follows: 1662 people have an income of 279, etc. If you see lines 2 and 3, there are 1956 people earning 218 but there are also 489 people earning the same amount. The difference between these two groups of people lies in a third column, not shown. (We could think of men and women, respectively, for example). a income grossing 1 279 1662 2 218 1956 3 218 489 4 378 278 5 420 278 6 200 289 7 149 191 8 256 1360 9 269 1348 10 1259 900 Now I create a vector of all people, one by one, with their respective incomes, by repeating income times grossing: aa <- rep(a$income, a$grossing) length(aa) [1] 8751 If I apply Marc's suggestion, z <- do.call(data.frame, rle(aa))[, c(2, 1)] colnames(z) <- c("x", "y") I obtain z x y 1 279 1662 2 218 2445 3 378 278 4 420 278 5 200 289 6 149 191 7 256 1360 8 269 1348 9 1259 900 That is, lines 2 and 3 in the original data frame have been merged. How can I retrieve the original data frame a? Do I need to use that 'missing' third column? And if so, how? I've read ?rle but it seems it only applies to vectors. Any help, once again, greatly appreciated... Regards, Jose -----Original Message----- From: Marc Schwartz [mailto:marc_schwartz at me.com] Sent: 30 July 2009 20:13 To: Jose Iparraguirre D'Elia Cc: r-help at r-project.org Subject: Re: [R] Retrieving original data frame after repetition On Jul 30, 2009, at 11:15 AM, Jose Iparraguirre D'Elia wrote:> Dear R users, > > Consider the first two columns of a data frame like this: > > z[,1:2] > > x y > > 1 1 1 > > 2 2 2 > > 3 3 3 > > 4 1 4 > > > > Imagine that y represents the times that the value x happens in a > population. But z is not exactly a frequency table, because in z we > have x=1 twice. So, the x=1 in the first line and the x=1 in the > fourth are not the same, differing according to a third variable in > the data frame. > > Now, I use the function rep() in order to obtain a vector of values > of x in the population: > > x.pop <- rep(x,y) > >> x.pop > > [1] 1 2 2 3 3 3 1 1 1 1 > > How can I go from x.pop back to z? If I use table(x.pop), I obtain a > frequency table like the one below, but not z. > > table(x.pop) > > x.pop > > 1 2 3 > > 5 2 3 > > > (I know I haven't deleted z, obviously, but I need to write a piece > of code to do something very similar). > > Just in case anyone is wondering by now whether this is an > assignment for college, etc.,-it is not. The real world problem I'm > working on at the moment has to do with income distribution in > Northern Ireland. I want to see how many people would leave poverty > if the income of those currently below 60% median income increases > by, say, ?20 a week. I am working with the Family Resources Survey > sample for Northern Ireland (n=2,263), which I have to gross up > before increasing the incomes (grossed up n=1,712,886). Once I > increased the income figures for those individuals in poverty, I > need to 'un-gross' the data to get back to n=2,263 -and table() > simply does not do the trick, because of exactly the same situation > in the example above. > > So, please, how can I retrieve z? > > Many thanks, > > JosePresuming that your larger case is similar in structure to 'x.pop', which is to say that each unique value is in sequential runs, you can use: z <- do.call(data.frame, rle(x.pop))[, c(2, 1)] colnames(z) <- c("x", "y") > z x y 1 1 1 2 2 2 3 3 3 4 1 4 See ?rle for more information on summarizing runs of values. The core of the first step above yields: > rle(x.pop) Run Length Encoding lengths: int [1:4] 1 2 3 4 values : num [1:4] 1 2 3 1 which is a list of two elements, that we coerce to a data frame using do.call(), reversing the two columns to match your original order. HTH, Marc Schwartz