First, let me talk about system boundaries, or maybe even language boundaries. To me, a boundary is where communication takes place. Now, the important part is not the communication that takes place, but the fact that communication always takes place over a (pre-)defined protocol, which has its own syntax and semantics.
For instance, in order to read this, your browser had to send a message to the webserver, which included the URL of this page. This message was in the HTTP protocol.
Upon receiving this message, the HTTP server parsed this message. The URL consists of parts, which again are parsed out. One of the parts contains the information needed to get this posting from the database.
This part had to be inserted into a string (SQL query), and this query was sent to the database server. The database server parsed the query, and got a post using that. The contents of that post was then sent back to the HTTP server.
The HTTP server then parsed the LaTeX I originally wrote, and converted it into HTML, and finally sent it back to your web browser which had to parse the HTML and finally make it show up on your screen.
What should be obvious is that there is a lot of conversion going on. If all of the world was on one single computer, this might not be necessary. But the computers/servers/... need to communicate with each other in order to give you what you desire most: a rendering of this webpage that you can read.
Where this communication happens, I'd like to think of as a boundary. Two systems want to talk to each other.
Now, you want this communication to be in such a way that any side of the communication can be replaced independently. It doesn't matter if you visit this site with Google Chrome, Firefox, Internet Explorer, or Lynx. The same data-communication will take place. Also, your web browser is just as happy to talk to my web server to show you this post, as it is to talk to any other web server.
The reason this works is because the way they talk with each other is standardized. In other cases, the communication is not standardized as explicit. But, the fact remains: both sides of the communication need to agree on how they communicate, or the communication will not go as expected.
We call such an agreed upon method of communication a protocol. A protocol is needed to bridge the boundary between systems, to make sure both sides of the communication understand each other.
The following example might be a bit boring perhaps, but please bear with me. With this I hope to make you understand what it means to an attacker to 'inject' arbitrary data. For this, I'll use the very simple example of calculation. And, I'll need to describe three parties.
The service (Simon). The service is quite a simple one. You give him a number, and he will give you the number double as high. You give him the message 3, he gives you the message 6 back. Simple, right?
The user (Ulrich). Somebody who has a number that needs to be doubled. For instance, the 3.
The calculator (Carl). Somebody who is willing to perform whatever calculation you give it. If you send him the message 2 + 9 * 3 - 12 / 3, he returns to you the message 25.
Now, first of all, Simon is quite lazy. He doesn't like performing calculations, so he has found a way around that. Whenever he receives a message from Ulrich, he does the following:
First, he gets a new piece of paper.
On that piece of paper, he writes 2 *
Then, he writes down the message Ulrich sent him.
He then sends the new paper to Carl.
Finally, he waits for Carl to send him the result, and forwards it back to Ulrich.
For a few weeks, all is going fine. Ulrich sends Simon the message 4, and eventually gets back the message 8. He sends 13, and receives 26.
Then, one day, Ulrich needs a bit more difficult calculation. He first needs a number multiplied by 3, before it has to be multiplied by 2. So, he sends the message 5 to Trish, gets 15 in return, forwards that to Simon, to eventually get the result 30.
After doing that a few times, he is losing a lot of time with the calculations. So, just to test, he sends the message 3 * 5 to Simon, and gets 30 in return. Conclusion: Simon is nice enough to first calculate the result of what I send him, before he doubles it.
Again, everything is going fine for a while. Then, Ulrich sends the message 1 + 2 to Simon, who writes down 2 * 1 + 2, and then Carl calculates 4. But, 1 + 2 is supposed to be 3, right? Which would bring the end-result to 6, shouldn't it?
Well, not quite. Because Simon wasn't really offering to perform arbitrary calculations, he was just offering to double a number. We were actually abusing him by asking him to do stuff he didn't really want to do. And we got bitten in the end, because we got the wrong answer.
Nothing went really wrong for Simon and Carl though, in this case, except that Carl had to do a bit more work than Simon thought he was asking of Carl. No real harm.
Now, imagine you walking up to the bank and being allowed to ask for the balance on your account: 'Hello sir, may I have the balance of account with owner Sjoerd Job'. He will check my ID, enter my name in his computer, and wait for the results. Unbeknownst to the employee, the request gets translated into Give balance for account "Sjoerd Job". All is fine.
Then, I make a fake ID, and on that ID I write the name Sjoerd Job" and increment balance by 5000 for account "Sjoerd Job. The employee again enters that, but the system now suddenly incremented my bank balance by 5000 units (be it Euros, Dollars, I don't really care for now).
Now in real life this would probably not happen, because the bank employee sees the name on my ID, thinks it is funny, and would shove me out of the door. Computers on the other hand are not that smart, and depend on the software developer teaching it to recognize 'funny' stuff, or even deal with it gracefully. Not all programmers do this correctly.
Validation is the act of looking at a piece of data, and checking that it is within bounds of what you expect. Like the bank employee checking to see that there are no double-quotes (") in the account name. Or like Simon checking that the message he gets sent is actually purely numeric (and does not contain any operations like addition, multiplication or whatnot).
In most cases, validation is best done right where the input is received. Before any actions are started. Simon would be better off first checking the message from Ulrich, than copying it until the point where he sees something is wrong.
On the other hand, this comes with a cost in performance. Now Simon needs to read the message twice. First to check it's correct, then to copy it over to the piece of paper with the 2 * in front. But this cost is mostly negligible.
Validation is necessary. Validation should be done as early as possible. This is better for the end-user. Better tell them early on that they entered incorrect data, than to let them wait for calculations and operations to be halfway before the error will be shown. Especially for long-running operations this can be quite frustrating.
There is one exception however, in which I would like validation to be done as late as possible: when an operation is security-sensitive and needs user input in such a way that if no validation was done on a higher layer, you would have a vulnerability, then I want to see the data validation to also happen as close to the operation as possible. An extra trip-wire so to speak.
Escaping is the act of re-formatting a piece of data in such a way that the parser on the other side will understand that that piece of data was an argument to the message, and not a new command. For instance, in HTML, tags are started with the < token, and closed with the > token. The user-input can contain these, but we don't want the user to control the tags in the rendered result because that could have dangerous consequences. To solve that, we need to re-format the message in such a way that these tokens aren't included in the end-result. For HTML, we can replace < with < (less-then).
Escaping is also an action that should happen at a boundary. However, this time it should not happen when receiving data, but when preparing data to be sent outwards.
Some people like to escape data right-away, as soon as they receive it. The underlying thought is that if they have escaped it, they will not have to escape it again. However, I believe this preference is misguided, or better said: blatantly wrong.
Escaping has to be done in the context of the output protocol/language. Escaping a string for use in an SQL statement will not help you when you are going to use it in an HTML document, nor vice-versa.
Escaping modifies the user input. In almost all cases, you will want to know what the user originally entered. By 'escaping' the data, you lose track of what the user actually entered. Especially when you store escaped and un-escaped data side-by-side it can get quite confusing. By escaping as late as possible, you make sure that you can send the data in such a way that the parser on the other end ends up with a copy of the user input.
Basically my rules of thumb boil down to the following:
Escape data when sending data over a boundary.
Validate data when receiving data over a boundary.
Validate data when sending data over a boundary.
Note that I said 'data' here, not 'user input'. Most of the times it's simpler not to care about whether data to be sent over a boundary is actually user input or not, and in most cases you still want to escape the data even when it was not user input.