thr3ads.net - R help - [R] dealing with a messy dataset [Oct 2017]

If this information is useful, please help other people find it:
Share via:

jean-philippe

2017-Oct-05 14:12 UTC

[R] dealing with a messy dataset

dear R-users,


I am facing a quite regular and basic problem when it comes to dealing 
with datasets, but I cannot find any satisfying answer so far.
I have a messy dataset of galaxies like that :

And XVIII          000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  
6.78 < 6.65  -44  0.5 MESSIER031               0.6  1.54
PAndAS-03          000356.4+405319  0.10 17     0.00  -3.6 27.8 
4.38                    2.8 MESSIER031               2.8  1.75
PAndAS-04          000442.9+472142  0.05 22     0.00  -6.6 23.1 
5.59              -108  2.5 MESSIER031               2.5  1.75
PAndAS-05          000524.1+435535  0.06 31     0.00  -4.5 25.6 
4.75               103  2.8 MESSIER031               2.8  1.75
ESO409-015         000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  
8.25   8.10  769 -2.0 NGC0024                 -1.5 -2.05
AGC748778          000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  
5.70   6.64  486 -1.9 NGC0253                 -1.5 -2.72
And XX             000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  
5.70        -182  2.4 MESSIER031               2.4  1.75

What I would like to do is to read this dataset, but I would like that 
the space between And and XVIII is not interpreted as 2 different 
columns but as the name of the galaxy in one column.
How is it possible to do so?

For instance I did this 
data1<-read.table("lvg_table2.txt",skip=70,fill=T) where I used
fill=T
because the rows don't have the same number of features since R splits 
the name of the galaxies into 2 columns because of the space.


Best Regards, thanks in advance


Jean-Philippe Fontaine

-- 
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

Boris Steipe

2017-Oct-05 14:19 UTC

head link

[R] dealing with a messy dataset

Is this a fixed width format?
If so, read.fwf() in base, or read_fwf() in the readr package will solve the
problem. You may need to trim trailing spaces though.


B.



> On Oct 5, 2017, at 10:12 AM, jean-philippe <jeanphilippe.fontaine at
gssi.infn.it> wrote:
> 
> dear R-users,
> 
> 
> I am facing a quite regular and basic problem when it comes to dealing with
datasets, but I cannot find any satisfying answer so far.
> I have a messy dataset of galaxies like that :
> 
> And XVIII          000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  6.78
< 6.65  -44  0.5 MESSIER031               0.6  1.54
> PAndAS-03          000356.4+405319  0.10 17     0.00  -3.6 27.8 4.38       
2.8 MESSIER031               2.8  1.75
> PAndAS-04          000442.9+472142  0.05 22     0.00  -6.6 23.1 5.59       
-108  2.5 MESSIER031               2.5  1.75
> PAndAS-05          000524.1+435535  0.06 31     0.00  -4.5 25.6 4.75       
103  2.8 MESSIER031               2.8  1.75
> ESO409-015         000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  8.25 
8.10  769 -2.0 NGC0024                 -1.5 -2.05
> AGC748778          000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  5.70 
6.64  486 -1.9 NGC0253                 -1.5 -2.72
> And XX             000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  5.70 
-182  2.4 MESSIER031               2.4  1.75
> 
> What I would like to do is to read this dataset, but I would like that the
space between And and XVIII is not interpreted as 2 different columns but as the
name of the galaxy in one column.
> How is it possible to do so?
> 
> For instance I did this
data1<-read.table("lvg_table2.txt",skip=70,fill=T) where I used
fill=T because the rows don't have the same number of features since R
splits the name of the galaxies into 2 columns because of the space.
> 
> 
> Best Regards, thanks in advance
> 
> 
> Jean-Philippe Fontaine
> 
> -- 
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

jean-philippe

2017-Oct-05 14:33 UTC

head link

[R] dealing with a messy dataset

dear Boris,

Thanks for your answer!

Yes it seems to  be a fixed-width format. I didn't remember this type of 
datasets since I am not used to analyze and process them.

Thanks anyway, it seems to fix the problem (I just need to reflect a bit 
more on the width of each feature)!


Cheers

Jean-Philippe

On 05/10/2017 16:19, Boris Steipe wrote:> Is this a fixed width format?
> If so, read.fwf() in base, or read_fwf() in the readr package will solve
the problem. You may need to trim trailing spaces though.
>
>
> B.
>
>
>
>
>> On Oct 5, 2017, at 10:12 AM, jean-philippe <jeanphilippe.fontaine at
gssi.infn.it> wrote:
>>
>> dear R-users,
>>
>>
>> I am facing a quite regular and basic problem when it comes to dealing
with datasets, but I cannot find any satisfying answer so far.
>> I have a messy dataset of galaxies like that :
>>
>> And XVIII          000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44 
6.78 < 6.65  -44  0.5 MESSIER031               0.6  1.54
>> PAndAS-03          000356.4+405319  0.10 17     0.00  -3.6 27.8 4.38   
2.8 MESSIER031               2.8  1.75
>> PAndAS-04          000442.9+472142  0.05 22     0.00  -6.6 23.1 5.59   
-108  2.5 MESSIER031               2.5  1.75
>> PAndAS-05          000524.1+435535  0.06 31     0.00  -4.5 25.6 4.75   
103  2.8 MESSIER031               2.8  1.75
>> ESO409-015         000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10 
8.25   8.10  769 -2.0 NGC0024                 -1.5 -2.05
>> AGC748778          000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39 
5.70   6.64  486 -1.9 NGC0253                 -1.5 -2.72
>> And XX             000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26 
5.70        -182  2.4 MESSIER031               2.4  1.75
>>
>> What I would like to do is to read this dataset, but I would like that
the space between And and XVIII is not interpreted as 2 different columns but as
the name of the galaxy in one column.
>> How is it possible to do so?
>>
>> For instance I did this
data1<-read.table("lvg_table2.txt",skip=70,fill=T) where I used
fill=T because the rows don't have the same number of features since R
splits the name of the galaxies into 2 columns because of the space.
>>
>>
>> Best Regards, thanks in advance
>>
>>
>> Jean-Philippe Fontaine
>>
>> -- 
>> Jean-Philippe Fontaine
>> PhD Student in Astroparticle Physics,
>> Gran Sasso Science Institute (GSSI),
>> Viale Francesco Crispi 7,
>> 67100 L'Aquila, Italy
>> Mobile: +393487128593, +33615653774
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
-- 
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

jim holtman

2017-Oct-05 14:49 UTC

head link

[R] dealing with a messy dataset

It looks like fixed width.  I just used the last position of each
field to get the size and used the 'readr' package;

    > input <- "And XVIII          000214.5+450520  0.69 17   9 0.00
-8.7 26.8 6.44  6.78 < 6.65  -44  0.5 MESSIER031               0.6
1.54
    + PAndAS-03          000356.4+405319  0.10 17     0.00  -3.6 27.8
4.38                    2.8 MESSIER031               2.8  1.75
    + PAndAS-04          000442.9+472142  0.05 22     0.00  -6.6 23.1
5.59              -108  2.5 MESSIER031               2.5  1.75
    + PAndAS-05          000524.1+435535  0.06 31     0.00  -4.5 25.6
4.75               103  2.8 MESSIER031               2.8  1.75
    + ESO409-015         000531.8-280553  3.00 78  23 0.00 -14.6 24.1
8.10  8.25   8.10  769 -2.0 NGC0024                 -1.5 -2.05
    + AGC748778          000634.4+153039  0.61 70   3 0.00 -10.4 24.9
6.39  5.70   6.64  486 -1.9 NGC0253                 -1.5 -2.72
    + And XX             000730.7+350756  0.20 33   5 0.00  -5.8 27.1
5.26  5.70        -182  2.4 MESSIER031               2.4  1.75"
    >
    > start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
    +            92, 114, 121, 127)
    > read_fwf(input, fwf_widths(diff(start)))
    # A tibble: 7 x 17
              X1              X2    X3    X4    X5    X6    X7    X8
 X9   X10   X11   X12   X13   X14
           <chr>           <chr> <dbl> <int> <int>
<dbl> <dbl> <dbl>
<dbl> <dbl> <chr> <dbl> <int> <dbl>
    1  And XVIII 000214.5+450520  0.69    17     9     0  -8.7  26.8
6.44  6.78     <  6.65   -44   0.5
    2  PAndAS-03 000356.4+405319  0.10    17    NA     0  -3.6  27.8
4.38    NA  <NA>    NA    NA   2.8
    3  PAndAS-04 000442.9+472142  0.05    22    NA     0  -6.6  23.1
5.59    NA  <NA>    NA  -108   2.5
    4  PAndAS-05 000524.1+435535  0.06    31    NA     0  -4.5  25.6
4.75    NA  <NA>    NA   103   2.8
    5 ESO409-015 000531.8-280553  3.00    78    23     0 -14.6  24.1
8.10  8.25  <NA>  8.10   769  -2.0
    6  AGC748778 000634.4+153039  0.61    70     3     0 -10.4  24.9
6.39  5.70  <NA>  6.64   486  -1.9
    7     And XX 000730.7+350756  0.20    33     5     0  -5.8  27.1
5.26  5.70  <NA>    NA  -182   2.4
    # ... with 3 more variables: X15 <chr>, X16 <dbl>, X17
<dbl>
    >


Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Oct 5, 2017 at 10:12 AM, jean-philippe
<jeanphilippe.fontaine at gssi.infn.it> wrote:> dear R-users,
>
>
> I am facing a quite regular and basic problem when it comes to dealing with
> datasets, but I cannot find any satisfying answer so far.
> I have a messy dataset of galaxies like that :
>
> And XVIII          000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  6.78
<
> 6.65  -44  0.5 MESSIER031               0.6  1.54
> PAndAS-03          000356.4+405319  0.10 17     0.00  -3.6 27.8 4.38
> 2.8 MESSIER031               2.8  1.75
> PAndAS-04          000442.9+472142  0.05 22     0.00  -6.6 23.1 5.59
> -108  2.5 MESSIER031               2.5  1.75
> PAndAS-05          000524.1+435535  0.06 31     0.00  -4.5 25.6 4.75
> 103  2.8 MESSIER031               2.8  1.75
> ESO409-015         000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  8.25
> 8.10  769 -2.0 NGC0024                 -1.5 -2.05
> AGC748778          000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  5.70
> 6.64  486 -1.9 NGC0253                 -1.5 -2.72
> And XX             000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  5.70
> -182  2.4 MESSIER031               2.4  1.75
>
> What I would like to do is to read this dataset, but I would like that the
> space between And and XVIII is not interpreted as 2 different columns but
as
> the name of the galaxy in one column.
> How is it possible to do so?
>
> For instance I did this
data1<-read.table("lvg_table2.txt",skip=70,fill=T)
> where I used fill=T because the rows don't have the same number of
features
> since R splits the name of the galaxies into 2 columns because of the
space.
>
>
> Best Regards, thanks in advance
>
>
> Jean-Philippe Fontaine
>
> --
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

jean-philippe

2017-Oct-05 15:02 UTC

head link

[R] dealing with a messy dataset

dear Jim,

Thanks for your reply and your proposition.

I forgot to provide the header of the dataframe, here it is:
===============================================================================Byte-by-byte
Description of file: lvg_table2.dat
--------------------------------------------------------------------------------
    Bytes Format Units       Label   Explanations
--------------------------------------------------------------------------------
    1- 18 A18    ---         Name    Galaxy name in well-known catalogs
   20- 21 I2     h           RAh     Hour of Right Ascension (J2000)
   22- 23 I2     min         RAm     Minute of Right Ascension (J2000)
   24- 27 F4.1   s           RAs     Second of Right Ascension (J2000)
       28 A1     ---         DE-     Sign of the Declination (J2000)
   29- 30 I2     deg         DEd     Degree of Declination (J2000)
   31- 32 I2     arcmin      DEm     Arcminute of Declination (J2000)
   33- 34 I2     arcsec      DEs     Arcsecond of Declination (J2000)
   36- 40 F5.2   kpc         a26     ? Major linear diameter (1)
   42- 43 I2     deg         inc     ? Inclination
   45- 47 I3     km/s        Vm      ? Amplitude of rotational velocity (2)
   49- 52 F4.2   mag         AB      ? Internal B band extinction (3)
   54- 58 F5.1   mag         BMag    ? Absolute B band magnitude (4)
   60- 63 F4.1   mag/arcsec2 SBB     ? Average B band surface brightness (5)
   65- 69 F5.2   [solLum]    logKLum ? Log K_S_ band luminosity (6)
   71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
       77 A1     ---       l_logMHI  Limit flag on logMHI
   78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
   84- 87 I4     km/s        VLG     ? Radial velocity (9)
   89- 92 F4.1   ---         Theta1  ? Tidal index (10)
   94-116 A23    ---         MD      Main disturber name (11)
  118-121 F4.1   ---         Theta5  ? Another tidal index (12)
  123-127 F5.2   [-]         Thetaj  ? Log K band luminosity density (13)
--------------------------------------------------------------------------------

The idea for me is to select only the galaxy name and the logMHI values 
for these galaxies, so quite a simple job when the dataset is tidy 
enough. I was thinking as usual to use select from dplyr.
That is why I was just asking how to read this kind of files which, for 
me so far, are uncommon.

Doing what you propose, it formats most of the columns correctly except 
few ones, I will see how I can change some width to get it correctly:

           X1              X2    X3    X4    X5    X6    X7    X8 X9    
X10   X11   X12   X13   X14          X15   X16     X17
        (chr)           (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)  
(chr) (int) (chr) (chr) (chr)        (chr) (dbl)   (chr)
1   UGC12894 000022.5+392944  2.78    33    21     0 -13.3  25.2 7.5 8  
8.1     7   7.9 2  61 9 -1.    3 NGC7640    -1 0  0.12
2        WLM 000158.1-152740  3.25    90    22     0 -14.1  24.8 7.7 0  
8.2     7   7.8 4  -1 6  0. 0 MESSIER031     0 2  1.75
3  And XVIII 000214.5+450520  0.69    17     9     0  -8.7  26.8 6.4 4  
6.7     8 < 6.6 5  -4 4  0. 5 MESSIER031     0 6  1.54
4  PAndAS-03 000356.4+405319  0.10    17    NA     0  -3.6  27.8 
4.3      8    NA    NA    NA    2. 8 MESSIER031     2 8  1.75
5  PAndAS-04 000442.9+472142  0.05    22    NA     0  -6.6  23.1 
5.5      9    NA    NA   -10 8  2. 5 MESSIER031     2 5  1.75
6  PAndAS-05 000524.1+435535  0.06    31    NA     0  -4.5  25.6 
4.7      5    NA    NA    10 3  2. 8 MESSIER031     2 8  1.75
7 ESO409-015 000531.8-280553  3.00    78    23     0 -14.6  24.1 8.1 0  
8.2     5   8.1 0  76 9 -2.    0 NGC0024    -1 5 -2.05
8  AGC748778 000634.4+153039  0.61    70     3     0 -10.4  24.9 6.3 9  
5.7     0   6.6 4  48 6 -1.    9 NGC0253    -1 5 -2.72
9     And XX 000730.7+350756  0.20    33     5     0  -5.8  27.1 5.2 6  
5.7     0    NA   -18 2  2. 4 MESSIER031     2 4  1.75


Cheers, thanks again


Jean-Philippe
On 05/10/2017 16:49, jim holtman wrote:> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>      +            92, 114, 121, 127)
>      > read_fwf(input, fwf_widths(diff(start)))
-- 
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

Maybe Matching Threads

Search for more maybe matching threads

R help - Oct 2017 - dealing with a messy dataset

[R] dealing with a messy dataset

[R] dealing with a messy dataset

[R] dealing with a messy dataset

[R] dealing with a messy dataset

[R] dealing with a messy dataset

Maybe Matching Threads