thr3ads.net - R help - [R] need help with excel data [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Jeff Newmiller

2015-Jan-22 01:02 UTC

[R] need help with excel data

I think R is quite capable of doing this. You would have to learn a 
comparable number of fiddly bits to accomplish this in R, Python or Perl.

That is not to say that learning Perl or Python is a bad idea... but in 
terms of "shortest path" I think they are of comparable complexity.
All
three languages support regular expressions, which would be the key bit of 
knowledge to acquire regardless of which tool you use.

Other fiddly bits might involve handling the cyrillic strings as data, 
though you did not convey a desire to retain that information.

One way (not extracting cyrillic text):

library(XLConnect)
DF <- readWorksheetFromFile( "exampX.xlsx", sheet="examp"
)
pattern <- "^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\* *(\\d+).*$"
idx <- grep( pattern, DF[[2]] )
dta <- sub( pattern, "\\1,\\2,\\3,\\4", DF[[2]][idx])
dtamatrix <- apply( do.call( rbind
                            , strsplit( dta, "," ) )
                   , 2
                   , as.numeric
                   )
extracted <- data.frame( V1=DF[[1]][idx], dtamatrix )


On Wed, 21 Jan 2015, Collin Lynch wrote:
> Dr. Polanski, I would recommend something else.  Given the messy nature of
> your data I would suggest using a language like Python or Perl to extract
> it to an appropriate format.  Python has good regular expression support
> and unicode support.  If you can save your data as a csv file or even text
> line by line then it would be possible to write some code to read the file,
> match the lines with a simple regular expression, and then spit them back
> out as a csv file which you could read into R.
>
> I realize that this means learning a new language or finding someone with
> the requisite skills by I would recommend that over attempting to use
R's
> text processing.
>
>    Collin.
>
> On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski <n.polyanskij at
gmail.com> wrote:
>
>> Hi all!
>>
>> Sorry to bother you, I am trying to learn some R via coursera courses
and
>> other internet sources yet haven?t managed to go far
>>
>> And now I need to do some, I hope, not too difficult things, which I
think
>> R can do, yet have no idea how to make it do so
>>
>> I have a big set of data (empirical) which was obtained by my
colleagues
>> and store at not convenient  way - all of the data in two cells of an
excel
>> table
>> an example of the data is in the attached file (the link)
>>
>>
>>
https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing
>>
>> so the first column has a number and the second has a whole vector (I
>> guess it is) which looks like
>> ?some words in Cyrillic(the length varies)? and then the set of numbers
>> ?12*23 34*45? (another problem that some times it is ?12*23, 34*56?
>>
>> And the number of raws is about 3000 so it is impossible to do manually
>>
>> what I need to have at the end is to have it separately in different
excel
>> cells
>> - what is written in words - |  12  | 23 | 34 | 45 |
>>
>> Do you think it is possible to do so using R (or something else?)
>>
>> Thank you very much in advance and sorry for asking for help and so
stupid
>> question, the problem is - I am trying and yet haven?t even managed to
>> install openSUSE onto my laptop - only Ubuntu! :)
>>
>>
>> Thank you very much!
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

Ista Zahn

2015-Jan-22 02:58 UTC

head link

[R] need help with excel data

I agree, R will be fine for this. Not being as expert with regex as
Jeff I would tend to do this in a few steps, something like

library(XLConnect)
DF <- readWorksheetFromFile( "exampX.xlsx", sheet="examp"
)
library(stringi)
## insert a marker between the text and the numbers
txt <- stri_replace_all_regex(DF[[2]], "([^\\d]{2,})(\\d+ )",
"$1|||$2")
## separate the text from the numbers
stringNums <- stri_split_fixed(txt, "|||", 2, simplify = TRUE)
## split the numbers apart
nums <- stri_split_regex(stringNums[, 2], "[^\\d]+", n = 5,
simplify=TRUE)
## put it all back together
extracted <- data.frame(DF[, 1], stringNums[, 1], apply(nums, 2, as.numeric))
## put the names back
names(extracted) <- c(names(DF)[1], paste(names(DF)[2], 1:6, sep =
"_"))

Best,
Ista

On Wed, Jan 21, 2015 at 8:02 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:> I think R is quite capable of doing this. You would have to learn a
> comparable number of fiddly bits to accomplish this in R, Python or Perl.
>
> That is not to say that learning Perl or Python is a bad idea... but in
> terms of "shortest path" I think they are of comparable
complexity. All
> three languages support regular expressions, which would be the key bit of
> knowledge to acquire regardless of which tool you use.
>
> Other fiddly bits might involve handling the cyrillic strings as data,
> though you did not convey a desire to retain that information.
>
> One way (not extracting cyrillic text):
>
> library(XLConnect)
> DF <- readWorksheetFromFile( "exampX.xlsx",
sheet="examp" )
> pattern <- "^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\*
*(\\d+).*$"
> idx <- grep( pattern, DF[[2]] )
> dta <- sub( pattern, "\\1,\\2,\\3,\\4", DF[[2]][idx])
> dtamatrix <- apply( do.call( rbind
>                            , strsplit( dta, "," ) )
>                   , 2
>                   , as.numeric
>                   )
> extracted <- data.frame( V1=DF[[1]][idx], dtamatrix )
>
>
> On Wed, 21 Jan 2015, Collin Lynch wrote:
>
>> Dr. Polanski, I would recommend something else.  Given the messy nature
of
>> your data I would suggest using a language like Python or Perl to
extract
>> it to an appropriate format.  Python has good regular expression
support
>> and unicode support.  If you can save your data as a csv file or even
text
>> line by line then it would be possible to write some code to read the
>> file,
>> match the lines with a simple regular expression, and then spit them
back
>> out as a csv file which you could read into R.
>>
>> I realize that this means learning a new language or finding someone
with
>> the requisite skills by I would recommend that over attempting to use
R's
>> text processing.
>>
>>    Collin.
>>
>> On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski <n.polyanskij at
gmail.com>
>> wrote:
>>
>>> Hi all!
>>>
>>> Sorry to bother you, I am trying to learn some R via coursera
courses and
>>> other internet sources yet haven?t managed to go far
>>>
>>> And now I need to do some, I hope, not too difficult things, which
I
>>> think
>>> R can do, yet have no idea how to make it do so
>>>
>>> I have a big set of data (empirical) which was obtained by my
colleagues
>>> and store at not convenient  way - all of the data in two cells of
an
>>> excel
>>> table
>>> an example of the data is in the attached file (the link)
>>>
>>>
>>>
>>>
https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing
>>>
>>> so the first column has a number and the second has a whole vector
(I
>>> guess it is) which looks like
>>> ?some words in Cyrillic(the length varies)? and then the set of
numbers
>>> ?12*23 34*45? (another problem that some times it is ?12*23, 34*56?
>>>
>>> And the number of raws is about 3000 so it is impossible to do
manually
>>>
>>> what I need to have at the end is to have it separately in
different
>>> excel
>>> cells
>>> - what is written in words - |  12  | 23 | 34 | 45 |
>>>
>>> Do you think it is possible to do so using R (or something else?)
>>>
>>> Thank you very much in advance and sorry for asking for help and so
>>> stupid
>>> question, the problem is - I am trying and yet haven?t even managed
to
>>> install openSUSE onto my laptop - only Ubuntu! :)
>>>
>>>
>>> Thank you very much!
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Collin Lynch

2015-Jan-22 03:24 UTC

head link

[R] need help with excel data

It is good to know R is up to the task and I have to agree with Ista and
Jeff that if you are more comfortable in R use it.  By way of comparison
the python code would look something like what is below.  You would need to
tweak the regular rexpression (re.match(...) to fit your needs but if you
are just learning Python then sticking with R might be a better choice.

   Best,
   Collin.

import csv, re

In = open("Sheet.csv", "r")
Reader = csv.DictReader(In)

Out = open("Out.csv")
Writer = csv.DictWriter(Out, ["Val", "Text",
"Numbers"])
Writer.writeheader()

for D in Reader:
  NewDict = {}
  NewDict["Val"] = D["Col1Name"]
  Match = re.match("(?P<Text>\S+) (?P<Numbers>[0-9]+
[0-9]+\*[0-9]+,?
[0-9]+*[0-9]+)$" D["Col2Name"])
  NewDict["Text"] = Match.group("Text")
  NewDict["Numbers"] Match.group("Numbers")
  Writer.writerow(NewDict)

In.close()
Out.close()

On Wed, Jan 21, 2015 at 9:58 PM, Ista Zahn <istazahn at gmail.com> wrote:
> I agree, R will be fine for this. Not being as expert with regex as
> Jeff I would tend to do this in a few steps, something like
>
> library(XLConnect)
> DF <- readWorksheetFromFile( "exampX.xlsx",
sheet="examp" )
> library(stringi)
> ## insert a marker between the text and the numbers
> txt <- stri_replace_all_regex(DF[[2]], "([^\\d]{2,})(\\d+ )",
"$1|||$2")
> ## separate the text from the numbers
> stringNums <- stri_split_fixed(txt, "|||", 2, simplify = TRUE)
> ## split the numbers apart
> nums <- stri_split_regex(stringNums[, 2], "[^\\d]+", n = 5,
simplify=TRUE)
> ## put it all back together
> extracted <- data.frame(DF[, 1], stringNums[, 1], apply(nums, 2,
> as.numeric))
> ## put the names back
> names(extracted) <- c(names(DF)[1], paste(names(DF)[2], 1:6, sep =
"_"))
>
> Best,
> Ista
>
> On Wed, Jan 21, 2015 at 8:02 PM, Jeff Newmiller
> <jdnewmil at dcn.davis.ca.us> wrote:
> > I think R is quite capable of doing this. You would have to learn a
> > comparable number of fiddly bits to accomplish this in R, Python or
Perl.
> >
> > That is not to say that learning Perl or Python is a bad idea... but
in
> > terms of "shortest path" I think they are of comparable
complexity. All
> > three languages support regular expressions, which would be the key
bit
> of
> > knowledge to acquire regardless of which tool you use.
> >
> > Other fiddly bits might involve handling the cyrillic strings as data,
> > though you did not convey a desire to retain that information.
> >
> > One way (not extracting cyrillic text):
> >
> > library(XLConnect)
> > DF <- readWorksheetFromFile( "exampX.xlsx",
sheet="examp" )
> > pattern <- "^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\*
*(\\d+).*$"
> > idx <- grep( pattern, DF[[2]] )
> > dta <- sub( pattern, "\\1,\\2,\\3,\\4", DF[[2]][idx])
> > dtamatrix <- apply( do.call( rbind
> >                            , strsplit( dta, "," ) )
> >                   , 2
> >                   , as.numeric
> >                   )
> > extracted <- data.frame( V1=DF[[1]][idx], dtamatrix )
> >
> >
> > On Wed, 21 Jan 2015, Collin Lynch wrote:
> >
> >> Dr. Polanski, I would recommend something else.  Given the messy
nature
> of
> >> your data I would suggest using a language like Python or Perl to
> extract
> >> it to an appropriate format.  Python has good regular expression
support
> >> and unicode support.  If you can save your data as a csv file or
even
> text
> >> line by line then it would be possible to write some code to read
the
> >> file,
> >> match the lines with a simple regular expression, and then spit
them
> back
> >> out as a csv file which you could read into R.
> >>
> >> I realize that this means learning a new language or finding
someone
> with
> >> the requisite skills by I would recommend that over attempting to
use
> R's
> >> text processing.
> >>
> >>    Collin.
> >>
> >> On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski <n.polyanskij at
gmail.com>
> >> wrote:
> >>
> >>> Hi all!
> >>>
> >>> Sorry to bother you, I am trying to learn some R via coursera
courses
> and
> >>> other internet sources yet haven?t managed to go far
> >>>
> >>> And now I need to do some, I hope, not too difficult things,
which I
> >>> think
> >>> R can do, yet have no idea how to make it do so
> >>>
> >>> I have a big set of data (empirical) which was obtained by my
> colleagues
> >>> and store at not convenient  way - all of the data in two
cells of an
> >>> excel
> >>> table
> >>> an example of the data is in the attached file (the link)
> >>>
> >>>
> >>>
> >>>
>
https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing
> >>>
> >>> so the first column has a number and the second has a whole
vector (I
> >>> guess it is) which looks like
> >>> ?some words in Cyrillic(the length varies)? and then the set
of numbers
> >>> ?12*23 34*45? (another problem that some times it is ?12*23,
34*56?
> >>>
> >>> And the number of raws is about 3000 so it is impossible to do
manually
> >>>
> >>> what I need to have at the end is to have it separately in
different
> >>> excel
> >>> cells
> >>> - what is written in words - |  12  | 23 | 34 | 45 |
> >>>
> >>> Do you think it is possible to do so using R (or something
else?)
> >>>
> >>> Thank you very much in advance and sorry for asking for help
and so
> >>> stupid
> >>> question, the problem is - I am trying and yet haven?t even
managed to
> >>> install openSUSE onto my laptop - only Ubuntu! :)
> >>>
> >>>
> >>> Thank you very much!
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >>
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >
> ---------------------------------------------------------------------------
> > Jeff Newmiller                        The     .....       .....  Go
> Live...
> > DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
> Go...
> >                                       Live:   OO#.. Dead: OO#.. 
Playing
> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> > /Software/Embedded Controllers)               .OO#.       .OO#.
> rocks...1k
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Jan 2015 - need help with excel data

[R] need help with excel data

[R] need help with excel data

[R] need help with excel data