thr3ads.net - R help - [R] Retrieving original data frame after repetition [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Jose Iparraguirre D'Elia

2009-Jul-30 16:15 UTC

[R] Retrieving original data frame after repetition

Dear R users,

 

Consider the first two columns of a data frame like this:

 
> z[,1:2]
  x y

1 1 1

2 2 2

3 3 3

4 1 4

 

Imagine that y represents the times that the value x happens in a population.
But z is not exactly a frequency table, because in z we have x=1 twice. So, the
x=1 in the first line and the x=1 in the fourth are not the same, differing
according to a third variable in the data frame.

 

Now, I use the function rep() in order to obtain a vector of values of x in the
population:

 
> x.pop <- rep(x,y)
> x.pop
 [1] 1 2 2 3 3 3 1 1 1 1

 

How can I go from x.pop back to z? If I use table(x.pop), I obtain a frequency
table like the one below, but not z.

 
> table(x.pop)
x.pop

1 2 3 

5 2 3

 

(I know I haven't deleted z, obviously, but I need to write a piece of code
to do something very similar).

 

Just in case anyone is wondering by now whether this is an assignment for
college, etc.,-it is not. The real world problem I'm working on at the
moment has to do with income distribution in Northern Ireland. I want to see how
many people would leave poverty if the income of those currently below 60%
median income increases by, say, £20 a week. I am working with the Family
Resources Survey sample for Northern Ireland (n=2,263), which I have to gross up
before increasing the incomes (grossed up n=1,712,886). Once I increased the
income figures for those individuals in poverty, I need to 'un-gross'
the data to get back to  n=2,263 -and table() simply does not do the trick,
because of exactly the same situation in the example above.

 

So, please, how can I retrieve z? 

 

Many thanks,

 

Jose

 

Mr José Luis Iparraguirre

Senior Research Economist

Economic Research Institute of Northern Ireland

2 -14 East Bridge Street

Belfast BT1 3NQ

Northern Ireland

United Kingdom

 

Tel: +44 (0)28 9072 7365

 


	[[alternative HTML version deleted]]

Marc Schwartz

2009-Jul-30 19:13 UTC

head link

[R] Retrieving original data frame after repetition

On Jul 30, 2009, at 11:15 AM, Jose Iparraguirre D'Elia wrote:
> Dear R users,
>
> Consider the first two columns of a data frame like this:
>
> z[,1:2]
>
> x y
>
> 1 1 1
>
> 2 2 2
>
> 3 3 3
>
> 4 1 4
>
>
>
> Imagine that y represents the times that the value x happens in a  
> population. But z is not exactly a frequency table, because in z we  
> have x=1 twice. So, the x=1 in the first line and the x=1 in the  
> fourth are not the same, differing according to a third variable in  
> the data frame.
>
> Now, I use the function rep() in order to obtain a vector of values  
> of x in the population:
>
> x.pop <- rep(x,y)
>
>> x.pop
>
> [1] 1 2 2 3 3 3 1 1 1 1
>
> How can I go from x.pop back to z? If I use table(x.pop), I obtain a  
> frequency table like the one below, but not z.
>
> table(x.pop)
>
> x.pop
>
> 1 2 3
>
> 5 2 3
>
>
> (I know I haven't deleted z, obviously, but I need to write a piece  
> of code to do something very similar).
>
> Just in case anyone is wondering by now whether this is an  
> assignment for college, etc.,-it is not. The real world problem I'm  
> working on at the moment has to do with income distribution in  
> Northern Ireland. I want to see how many people would leave poverty  
> if the income of those currently below 60% median income increases  
> by, say, ?20 a week. I am working with the Family Resources Survey  
> sample for Northern Ireland (n=2,263), which I have to gross up  
> before increasing the incomes (grossed up n=1,712,886). Once I  
> increased the income figures for those individuals in poverty, I  
> need to 'un-gross' the data to get back to  n=2,263 -and table()  
> simply does not do the trick, because of exactly the same situation  
> in the example above.
>
> So, please, how can I retrieve z?
>
> Many thanks,
>
> Jose
Presuming that your larger case is similar in structure to 'x.pop',  
which is to say that each unique value is in sequential runs, you can  
use:

z <- do.call(data.frame, rle(x.pop))[, c(2, 1)]

colnames(z) <- c("x", "y")

 > z
x y
1 1 1
2 2 2
3 3 3
4 1 4


See ?rle for more information on summarizing runs of values. The core  
of the first step above yields:

 > rle(x.pop)
Run Length Encoding
lengths: int [1:4] 1 2 3 4
values : num [1:4] 1 2 3 1

which is a list of two elements, that we coerce to a data frame using  
do.call(), reversing the two columns to match your original order.

HTH,

Marc Schwartz

Jose Iparraguirre D'Elia

2009-Jul-31 10:52 UTC

head link

[R] Retrieving original data frame after repetition

Hi Marc (et al)

I've spoken too soon...

Please, have a look at this chunk of real world data.

The data frame a below contains the first ten records (and first two columns) of
a survey dataset. It reads as follows: 1662 people have an income of 279, etc.
If you see lines 2 and 3, there are 1956 people earning 218 but there are also
489 people earning the same amount. The difference between these two groups of
people lies in a third column, not shown. (We could think of men and women,
respectively, for example).

a
   income grossing
1     279     1662
2     218     1956
3     218      489
4     378      278
5     420      278
6     200      289
7     149      191
8     256     1360
9     269     1348
10   1259      900


Now I create a vector of all people, one by one, with their respective incomes,
by repeating income times grossing:

aa <- rep(a$income, a$grossing)
length(aa)
[1] 8751

If I apply Marc's suggestion, 

z <- do.call(data.frame, rle(aa))[, c(2, 1)]
colnames(z) <- c("x", "y")

I obtain 

z
     x    y
1  279 1662
2  218 2445
3  378  278
4  420  278
5  200  289
6  149  191
7  256 1360
8  269 1348
9 1259  900

That is, lines 2 and 3 in the original data frame have been merged.

How can I retrieve the original data frame a?

Do I need to use that 'missing' third column? And if so, how? I've
read ?rle but it seems it only applies to vectors.

Any help, once again, greatly appreciated...

Regards,

Jose




-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at me.com] 
Sent: 30 July 2009 20:13
To: Jose Iparraguirre D'Elia
Cc: r-help at r-project.org
Subject: Re: [R] Retrieving original data frame after repetition

On Jul 30, 2009, at 11:15 AM, Jose Iparraguirre D'Elia wrote:
> Dear R users,
>
> Consider the first two columns of a data frame like this:
>
> z[,1:2]
>
> x y
>
> 1 1 1
>
> 2 2 2
>
> 3 3 3
>
> 4 1 4
>
>
>
> Imagine that y represents the times that the value x happens in a  
> population. But z is not exactly a frequency table, because in z we  
> have x=1 twice. So, the x=1 in the first line and the x=1 in the  
> fourth are not the same, differing according to a third variable in  
> the data frame.
>
> Now, I use the function rep() in order to obtain a vector of values  
> of x in the population:
>
> x.pop <- rep(x,y)
>
>> x.pop
>
> [1] 1 2 2 3 3 3 1 1 1 1
>
> How can I go from x.pop back to z? If I use table(x.pop), I obtain a  
> frequency table like the one below, but not z.
>
> table(x.pop)
>
> x.pop
>
> 1 2 3
>
> 5 2 3
>
>
> (I know I haven't deleted z, obviously, but I need to write a piece  
> of code to do something very similar).
>
> Just in case anyone is wondering by now whether this is an  
> assignment for college, etc.,-it is not. The real world problem I'm  
> working on at the moment has to do with income distribution in  
> Northern Ireland. I want to see how many people would leave poverty  
> if the income of those currently below 60% median income increases  
> by, say, ?20 a week. I am working with the Family Resources Survey  
> sample for Northern Ireland (n=2,263), which I have to gross up  
> before increasing the incomes (grossed up n=1,712,886). Once I  
> increased the income figures for those individuals in poverty, I  
> need to 'un-gross' the data to get back to  n=2,263 -and table()  
> simply does not do the trick, because of exactly the same situation  
> in the example above.
>
> So, please, how can I retrieve z?
>
> Many thanks,
>
> Jose
Presuming that your larger case is similar in structure to 'x.pop',  
which is to say that each unique value is in sequential runs, you can  
use:

z <- do.call(data.frame, rle(x.pop))[, c(2, 1)]

colnames(z) <- c("x", "y")

 > z
x y
1 1 1
2 2 2
3 3 3
4 1 4


See ?rle for more information on summarizing runs of values. The core  
of the first step above yields:

 > rle(x.pop)
Run Length Encoding
lengths: int [1:4] 1 2 3 4
values : num [1:4] 1 2 3 1

which is a list of two elements, that we coerce to a data frame using  
do.call(), reversing the two columns to match your original order.

HTH,

Marc Schwartz

Maybe Matching Threads

Search for more possibly parallel threads

R help - Jul 2009 - Retrieving original data frame after repetition

[R] Retrieving original data frame after repetition

[R] Retrieving original data frame after repetition

[R] Retrieving original data frame after repetition

Maybe Matching Threads