Code and data
In a recent Stack Overflow question, I answered a question which started off with a broken XPath expression by suggesting that that poster might be better off using LINQ to XML instead. The discussion which followed in the comments (around whether or not this was an appropriate answer) led me to think about the nature of code and data, and how important context is.
I don't think there's any particularly deep insight in this post - so I'll attempt to keep it relatively short. However, you might like to think about how code and data interact in your own experience, and what the effects of this can be.
Code is data
Okay, so let's start off with the obvious: all code is data, at some level. If it's compiled code, it's just binary data which a machine can execute. Put it on another machine with no VM, and there's nothing remarkable about it. It's just a load of 1s and 0s. As source code, most languages are just plain text. Open up some source code written in C#, Ruby, Python, Java, C++ etc in Notepad and it'll be readable. You may miss the syntax highlighting and so forth, but it's still just text.
Code in the right context is more than just data
So what makes this data different to (say) a CSV file or a plain text story? It's all in the context. When you load it into the right editor, or pass it to the right compiler, you get more information: in an editor you may see the aforementioned syntax highlighting, autocompletion, documentation for members you're using; a compiler will either produce errors or a binary file. For something like Python or Ruby, you may want to feed the source into an interpreter instead of a compiler, but the principle is the same: the data takes on more meaning.
Code in the wrong code-related context is just data again
Now let's think about typical places where you might put code (or something with similar characteristics) into the "wrong" context:
- SQL statements
- XSLT transformations
- XPath expressions
- XML or HTML text
- Regular expressions
All of these languages have editors which understand them, and will help you avoid problems. All of these are also possible to embed in other code - C#, for example. Indeed, almost all the regular expressions I've personally written have ended up in Java or C# code. At that point, there are two problems:
- You may want to include text which doesn't embed easily within the "host" language's string literals (particularly double quotes, backslashes and newlines)
- The code editor doesn't understand the additional meaning to the text
The first problem is at least somewhat mitigated by C#'s support for verbatim string literals - only double quotes remain as a problem. But the second problem is the really big one. Visual Studio isn't going to check that your regular expression or XPath expression looks valid. It's not going to give you syntax highlighting for your SQL statement, much less IntelliSense on the columns present in your database. Admittedly such a thing might be possible, if the IDE looked ahead to find out where the text was going to be used - but I haven't seen any IDE that advanced yet. (The closest I've seen is ReSharper noticing when you're using a format string with the wrong number of parameters - that's primitive but still really useful.)
Of course, you could write your SQL (or XPath etc) in a dedicated editor, and then either copy and paste it into your code or embed it into your eventual binary and load it at execution time. Neither of these is particularly appealing. Copy and paste works well once, but then when you're reading or modifying the code you lose the advantages you had unless you copy and paste it again. Embedding the file can work well in some cases - I use it liberally for test data in unit tests, for example - but I wouldn't want it all over production code. It means that when reading the code, you have to refer to the external resource to work out what's going to happen. In some cases that's not too bad - it's only like opening another class or method, I guess - but in other cases the shift of gears is too distracting.
When code is data, it's easy to mix it with other data - badly
Within C# code, it's easy to see the bits of data which sometimes occur in your code: string or numeric literals, typically. Maybe you subscribe to the "no magic values" philosophy, and only ever have literals (other than 0 or 1, typically) as values for constants. Well, that's just a level of indirection - which in some ways hides the fact that you've still got magic values. If you're only going to use a piece of data once, including it directly in-place actually adds to readability in my view. Anyway, without wishing to dive into that particular debate too deeply, the point is that the compiler (or whatever) will typically stop you from using that data as code - at least without being explicit about it. It will make sure that if you're using a value, it really is a value. If you're trying to use a variable, it had better be a variable. Putting a variable name in quotes means it's just text, and using a word without the quotes will make the compiler complain unless you happen to have a variable with the right name.
Now compare that with embedding XPath within C#, where you might have:
var node = doc.SelectSingleNode("//foo/bar[@baz=xyz]");
Now it may be obvious to you that "xyz" is meant to be a value here, not the name of an attribute, an element, a function or anything like that... but it's not obvious to Visual Studio, which won't give you any warnings. This is only a special case of the previous issue of invalid code, of course, but it does lead onto a related issue... SQL injection attacks.
When you've already got your "code" as a simple text value - a string literal containing your SQL statement, as an obvious example - it's all too easy to start mixing that code/data with genuine data data: a value entered by a user, for example. Hey, let's just concatenate the two together. Or maybe use a format string, effectively mixing three languages (C#, SQL, the primitive string formatting "language" of string.Format) into a single statement. We all know the results, of course: nothing differentiates between the code/data and the genuine data, so if the user-entered value happens to look like SQL to drop a database table, we end up with Little Bobby Tables.
I'm sure 99% of my blog readers know the way to avoid SQL injection attacks: use parameterized SQL statements. Keep the data and the code separate, basically.
Expressing the same ideas, but back in the "native" language
Going back to the start of all this, the above is why I like LINQ to XML. When I express a query using LINQ to XML, it's often a lot longer than it would have been in the equivalent XPath - but I can tell where the data goes. I know where I'm using an element name, where I'm using an attribute name, and where I'm comparing or extracting values. If I miss out some quotes, chances are pretty high that the resulting code will be invalid, and it'll be obvious where the problem is. I'm prepared to sacrifice brevity for the fact that I only work in a single language + library, instead of trying to embed one language within another.
Likewise building XML using LINQ to XML is much better than concatenating strings - I don't need to worry about any nasty escaping issues, for example. LINQ to XML has been so nicely design, it makes all kinds of things incredibly easy.
Regular expressions can sometimes be replaced by simple string operations. Where they can, I will often do so. I'd rather use a few IndexOf and Substring calls over a regular expression in general - but where the patterns I need get too tricky, I will currently fall back to regular expressions. I'm aware of ReadableRex but I haven't looked at it in enough detail to say whether it can take the place of "normal" regular expressions in the way that LINQ to XML can so often take the place of XPath.
Of course, LINQ to SQL (and the Entity Framework) do something similar for SQL... although that's slightly different, and has its own issues around predictability.
In all of these cases, however, the point is that by falling back to more verbose but more native-feeling code, some of the issues of embedding one language within another are removed. Code is still code, data is data again, and the two don't get mixed up with each other.
Conclusion
If I ever manage to organize these thoughts in a more lucid way, I will probably just rewrite them as another (shorter) post. In the meantime, I'd urge you to think about where your code and data get uncomfortably close.