Split string when a capital letter follows a lower cap letter in the middle of a word in R
I have some problems with different strings being concatenated and which I would like to split again. I am dealing with things such as name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol" which in this case should be split in "o-n-Butylhydroxylamine", "1-Methylpropylhydroxylamine" and "Amino-2-butanol" Any thoughts how I could use strsplit and/or gsub regular expression to achieve this? The rule I would like to use is that I would like to split a word when either a number, a bracket ("(") or a capital letter follows a lower caps letter. Any thoughts how to do this?
You could use positive look-around assertions to find (and then split at) inter-character positions preceded by a lower case letter and succeeded by an upper case letter, a digit, or a (. name <- "o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol" pat <- "(?<=[[:lower:]])(?=[[:upper:][:digit:](])" strsplit(name, pat, perl=TRUE) # [] #  "o-n-Butylhydroxylamine" "1-Methylpropylhydroxylamine" #  "Amino-2-butanol"
strsplit(name, "(?<=([a-z]))(?=[A-Z]|[0-9]|\\()", perl=TRUE) # [] #  "o-n-Butylhydroxylamine" "1-Methylpropylhydroxylamine" "Amino-2-butanol" Remember that the return value is a list, so use [] if appropriate.
Try this: name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol" print(strsplit(gsub("([a-z])(\\d)","\\1#\\2", gsub("([a-z])([A-Z])","\\1#\\2",name)),"#")[]) It assumes a non-cap letter followed by a digit is a split as well as a non-cap followed by a cap.
Match words with num/dash/underscore
Regex to match unescaped pairs of braces
How to use regular expressions in bash [duplicate]
Regular expression spaces in words
Logstash Multiline filter for websphere/java logs
What regex expression stands for a double line?
Grep pattern to output substring if line contains string
Append hours and minutes from Date - most efficient way
Using regular expression to remove the parent paragraph of a placeholder in an RTF template
Using RegEx matches with PowerShell
Using sed in bash to match a specific character as long as it is not preceded or followed by any other character
how to search pattern with numerals in vim
regex list and count string occurrences
Powershell Regex Grouping from Select-String
join digits by removing special characters
Version regular expression in CMake