regex


Why is C#'s Regex.Matches() returning all matches in a single Match object?


I am having problems in getting all <script> and its respective closing </script> tags from a html text using via regular expressions, in C#.
I created a sample html that looks like:
<html>
<head>
<title>
</title>
<script src="adasdsadsda.js"></script>
</head>
<body>
<script type='javascript'>
var a = 1 + 2;
alert('a');
</script>
</body>
<script></script>
</html>
The regular expression I am using is:
<script.*>[^>]*<\/script>
I often use regexr to validate/test my regular expressions (highly recommend it!). It shows the regular expression in question captures 3 occurrences (just as I expect).
But C#'s regex.Matches is not capturing 3 instances, instead, a single one with all occurrences in it. Is this the expected behavior for the Matches method ? I have been using it quite a lot and have been getting all occurrences as a separate capture.
Why is this happening in my case ?
P.S: In answering the question, if you want to point out that regex is not suited for parsing HTML, please explain how come regexr and .NET's Regex give different results ? Do they have different regex implementations ?
RegExr uses your browser's RegExp engine for matching. It implements a different regex flavor.
.net uses a unique regex flavor, so I'd suggest using a .net online tester instead. For example:
Regex Hero
Regex Storm
However, the pattern <script.*>[^>]*<\/script> should return the same matched text in almost all flavors.
Code
string pattern = #"<script.*>[^>]*<\/script>";
var re = new Regex( pattern);
var text = #"
<html>
<head>
<title>
</title>
<script src=""adasdsadsda.js""></script>
</head>
<body>
<script type='javascript'>
var a = 1 + 2;
alert('a');
</script>
</body>
<script></script>
</html>
";
MatchCollection matches = re.Matches(text);
for (int mnum = 0; mnum < matches.Count; mnum++)
{ //loop matches
Match match = matches[mnum];
Console.WriteLine("Match #{0} - Value: {1}", mnum + 1, match.Value);
}
Output
Match #1 - Value: <script src="adasdsadsda.js"></script>
Match #2 - Value: <script type='javascript'>
var a = 1 + 2;
alert('a');
</script>
Match #3 - Value: <script></script>
ideone demo
That said, if you have a > sign in your JavaScript code (as part of an IF condition or in a string), it would fail.
There are many reasons not to parse HTML with regex, so please take the following advice: don't use regex. Instead, you can use the HTML Agility Pack(1). edit: Instead, I recommend using a HTML parser.
I am marking Mariano's answer as the solution, but am leaving here the outcome of further research, which is not mentioned in the selected answer:
Seems the most popular options would be, in order of popularity, the following nuget packages:
Html Agility Pack
CsQuery
AngleSharp
I ended up using AngleSharp, which has the advantage over CsQuery of still being maintained/developed.

Related Links

REGEX reformatting
GPA regex in Perl
Replace string unless between two points
TCL passing lists of regexes through command line
Match a Regular Expression by simple 2 cases:
Regex ignore Find and Replace in Notepad++
Perl: How to match a string that is not in a given string [duplicate]
glob2rx in R to get all cells whose last decimal is 5?
Finding file names without a specified character
Perl: quick replacing of occurrences of multiple words in an array
Extract resolution from string
How do I create a Scala Regex that is compiled using Java Pattern.COMMENTS?
Is there a way to search terms in order with RegexpQuery in lucene?
Regex to allow any charcter EXCEPT backslash
Regex: Match a condition, then find the first occurrence of another condition that precedes it on the same line. Possible?
Regex for URL routing - match alphanumeric and dashes except words in this list

Categories

HOME
mongodb
lua
react-navigation
haskell-stack
backand
prebuild
xquery
normalization
restsharp
microsoft-dynamics
jena
struts2-jquery
light-inject
google-weather-api
calabash-android
image-gallery
zap
stm32f4discovery
jscodeshift
ghost-blog
jpos
powershell-v2.0
django-autocomplete-light
urlrewriter.net
nice-language
unsigned-integer
ruamel.yaml
python-behave
concurrenthashmap
toastr
django-users
swipe
sqlite.swift
gradient-descent
jackrabbit
api-doc
delete-row
xaf
sourcetree
tibco-ems
dsx
extjs4
symfony-process
visualsvn-server
vega-lite
dump
attributerouting
python-module
jcuda
geneticsharp
error-correction
redux-framework
proc
jsbin
building
telecommunication
python-3.1
webmail
aplpy
google-api-ruby-client
strace
kinto
clipping
python-dragonfly
uikeyboard
database-tuning-advisor
windows-universal
theorem-proving
llvm-gcc
isql
imake
tab-delimited
subdirectory
xcode6.1.1
preon
dfsort
parsekit
appstore-sandbox
simpleaudioengine
zephir
qdebug
arbtt
layout-manager
teamcity-7.0
oledbcommand
bapi
custom-tag
rails-3.1
webshop
drupal-gmap
queryanalyzer
anemic-domain-model
sharepoint-feature
xlink

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App