regex


D: split string by comma, but not quoted string


I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.

Related Links

Regex to replace &nbsp with
How do I replace one or more whitespace characters using the replace() function in XQuery?
.htaccess Pretty URL not displaying correctly with redirection
regex replacing several special characters
Creating a delimited text using regex
Perl regex with a negative lookahead behaves unexpectedly
Regex Get path name from full path
Replacing comma's by dots in floats using regular expressions
Visual Studio 2012 Regexes
regEx search/replace variable name without preceeding “_”
Perl regex script and command line different
Combine Multiple Regexp Patterns
Regular expression for alphamumeric value and can contain hyphen
Regex to determine if string contains all the vowels? [closed]
Regex matching “a > b > c” pattern?
Notepad++ Regex replace complete string within double quotes. Tried all possibilities with no use

Categories

HOME
regex
typescript
opencl
orbeon
tags
spring-roo
bolt-cms
amazon-kinesis
cql3
wikipedia
arm
textbox
android-bluetooth
coded-ui-tests
zipfile
hadoop2
tcplistener
copy-paste
stm
google-weather-api
frequency
luci
jquery-jtable
condor
calabash-android
orchardcms-1.10
textmatebundles
many-to-many
montecarlo
cadvisor
jqplot
bitcode
angular-services
phonegap-desktop-app
editorconfig
ocean
silverlight-5.0
skyscanner
mobilefirst-bluemix
qweb
vtable
bem
windows-scripting
jsonstore
uitabbarcontroller
nomenclature
jpype
integer-programming
pulseaudio
http-status-code-500
uifont
psycopg2
mrtg
node.js-tape
ammonite
cmsmadesimple
itertools
jeditorpane
zenity
android-alertdialog
cakephp-2.3
mit-scheme
cglib
codeigniter-upload
xnamespace
heritrix
android-location
app-engine-ndb
winobjc
vsvim
groupbox
kinto
wand
uac
folder-structure
tun
json4s
mtp
sketching
iphone-5
odoo
fortrabbit
twitter-follow
java-client
adserver
bitrock
master-theorem
dataformat
keymapping
jquery-mobile-popup
datapump
source-highlighting
html5-apps
wescheme
eject
time-limiting
httpcontext.cache
magento-1.5
data-oriented-design
ssao
codi
cluetip
system-tray
net-ssh
case-tools

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App