regex


Split string when a capital letter follows a lower cap letter in the middle of a word in R


I have some problems with different strings being concatenated and which I would like to split again.
I am dealing with things such as
name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
which in this case should be split in
"o-n-Butylhydroxylamine", "1-Methylpropylhydroxylamine" and "Amino-2-butanol"
Any thoughts how I could use strsplit and/or gsub regular expression to achieve this?
The rule I would like to use is that I would like to split a word when either a number, a bracket ("(") or a capital letter follows a lower caps letter. Any thoughts how to do this?
You could use positive look-around assertions to find (and then split at) inter-character positions preceded by a lower case letter and succeeded by an upper case letter, a digit, or a (.
name <- "o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
pat <- "(?<=[[:lower:]])(?=[[:upper:][:digit:](])"
strsplit(name, pat, perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine" "1-Methylpropylhydroxylamine"
# [3] "Amino-2-butanol"
strsplit(name, "(?<=([a-z]))(?=[A-Z]|[0-9]|\\()", perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine" "1-Methylpropylhydroxylamine" "Amino-2-butanol"
Remember that the return value is a list, so use [[1]] if appropriate.
Try this:
name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
print(strsplit(gsub("([a-z])(\\d)","\\1#\\2",
gsub("([a-z])([A-Z])","\\1#\\2",name)),"#")[[1]])
It assumes a non-cap letter followed by a digit is a split as well as a non-cap followed by a cap.

Related Links

Match words with num/dash/underscore
Regex to match unescaped pairs of braces
How to use regular expressions in bash [duplicate]
Regular expression spaces in words
Logstash Multiline filter for websphere/java logs
What regex expression stands for a double line?
Grep pattern to output substring if line contains string
Append hours and minutes from Date - most efficient way
Using regular expression to remove the parent paragraph of a placeholder in an RTF template
Using RegEx matches with PowerShell
Using sed in bash to match a specific character as long as it is not preceded or followed by any other character
how to search pattern with numerals in vim
regex list and count string occurrences
Powershell Regex Grouping from Select-String
join digits by removing special characters
Version regular expression in CMake

Categories

HOME
ffmpeg
sqlite
google-search
lua
dicom
elf
android-activity
feed
normalization
antlr
weblogic11g
delphi-xe7
jbpm
command-line-arguments
jsqmessagesviewcontroller
image-gallery
pywin32
pcap
semantics
c-strings
dsc
easendmail
stylesheet
grouping
yii1.x
typemock-isolator
scalability
android-button
codewarrior
android-imageview
localdb
react-redux-form
advanced-installer
fipy
docker-image
drupal-webform
jsonstore
ibm-wcm
psychopy
pac
logicblox
spark-submit
wp-api
distributed-lock
ewsjavaapi
html-lists
mashery
abbyy
ice
a-star
netbeans-7
angular-directive
findall
mysql-error-1050
lapply
statistics-bootstrap
post-increment
facebook-audience-network
codeigniter-upload
tealeaf
webpagetest
kernighan-and-ritchie
network-traffic
activemodel
jstorm
target-platform
webmail
groupbox
airplay
google-api-ruby-client
ctakes
multiple-files
wgs84
membership
clipping
post-processor
python-2.3
crosswalk
gem5
movie
gitattributes
theorem-proving
min3d
lossless-compression
simpleadapter
android-tablelayout
notifydatasetchanged
team-explorer-everywhere
bunny
mongo-shell
modx-evolution
mainwindow
kmz
gdb-python
java-ee-5
android-actionmode
hpple
molehill
postgresql-8.1
boost-gil
sdl.net
cluetip
coolstorage
unitils
high-traffic
iphone-sdk-3.1
run-length-encoding

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App