regex


How can I find repeated words in a file using grep/egrep?


I need to find repeated words in a file using egrep (or grep -e) in unix (bash)
I tried:
egrep "(\<[a-zA-Z]+\>) \1" file.txt
and
egrep "(\b[a-zA-Z]+\b) \1" file.txt
but for some reason these consider things to be repeats that aren't!
for example, it thinks the string "word words" meets the criteria despite the word boundary condition \> or \b.
\1 matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the \b is inside the capture parentheses.
If you want the second instance to also be on a word boundary, you need to say so:
egrep "(\b[a-zA-Z]+) \1\b" file.txt
That is no different from:
egrep "\b([a-zA-Z]+) \1\b" file.txt
The space in the pattern forces a word boundary, so I removed the redundant \bs. If you wanted to be more explicit, you could put them in:
egrep "\<([a-zA-Z]+)\> \<\1\>" file.txt
This is the expected behaviour. See what man grep says:
The Backslash Character and Special Expressions
The symbols \< and > respectively match the empty string at the
beginning and end of a word. The symbol \b matches the empty string at
the edge of a word, and \B matches the empty string provided it's not
at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and
\W is a synonym for [^[:alnum:]].
and then in another place we see what "word" is:
Matching Control
Word-constituent characters are letters, digits, and the underscore.
So this is what will produce:
$ cat a
hello bye
hello and and bye
words words
this are words words
"words words"
$ egrep "(\b[a-zA-Z]+\b) \1" a
hello and and bye
words words
this are words words
"words words"
$ egrep "(\<[a-zA-Z]+\>) \1" a
hello and and bye
words words
this are words words
"words words"
I use
pcregrep -M '(\b[a-zA-Z]+)\s+\1\b' *
to check my documents for such errors. This also works if there is a line break between the duplicated words.
Explanation:
-M, --multiline run in multiline mode (important if a line break is between the duplicated words.
[a-zA-Z]+: Match words
\b: Word boundary, see tutorial
(\b[a-zA-Z]+) group it
\s+ match at least one (but as many more as necessary) whitespace characters. This includes newline.
\1: Match whatever was in the first group
egrep "(\<[a-zA-Z]+>) \<\1\>" file.txt
fixes the problem.
basically, you have to tell \1 that it needs to stay in word boundaries too

Related Links

Posix regex in Postgresql to extract from quoted text?
Regular expression sequence matching
Perl change number to words
regex to match value up to 2 decimal
What is the mappings.ts file and how should it be set up in Tritium?
Symfony2 IBAN Validator returns false for valid UK IBAN
Regex to find anchor tags which are without http or https in the href attribute
What's the difference between [:space:] and [:blank:]?
Regex to replace &nbsp with
How do I replace one or more whitespace characters using the replace() function in XQuery?
.htaccess Pretty URL not displaying correctly with redirection
regex replacing several special characters
Creating a delimited text using regex
Perl regex with a negative lookahead behaves unexpectedly
Regex Get path name from full path
Replacing comma's by dots in floats using regular expressions

Categories

HOME
sqlite
whatsapp
magento
watson
design
weight
ibm-midrange
android-bluetooth
splunk
microsoft-dynamics
player-swift
struts2-jquery
dbus
h2o
identifier
uwsgi
frequency
command-line-arguments
yii2-advanced-app
lighttpd
asp.net-mvc-5.2
calabash-android
fbx
apache-kafka-streams
amadeus
gitlab-ci-runner
swig
rtp
co
ibm-datapower
aspdotnetstorefront
netcdf
php-carbon
italic
pvs-studio
regex-group
google-geocoder
scom
vrtk
gosublime
doctrine-odm
inria-spoon
capstone
advanced-installer
gradient-descent
powerbuilder-conversion
vue2
cookiestore
jsonstore
google-slides
essbase
subscription
polymer-cli
mdanalysis
phasset
distributed-lock
freecodecamp
arm7
fido
kony
sshd
ietf-netconf
curly-braces
nsd
ispconfig
intrinsics
vimeo-ios
java-gstreamer
document-ready
readline
wchar-t
interpreted-language
json4s
heroku-toolbelt
stringr
solr-boost
sql-server-ce-3.5
wintersmith
llvm-gcc
formhelper
prettyfaces
email-spam
ccombobox
virtual-earth
godaddy-api
timespan
squirrelmail
documentviewer
freeglut
ekeventkit
exact-synergy-enterprise
facebook-authentication
android-searchmanager
pinging
greensoftware
case-tools

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App