Roy Tang

Programmer, engineer, scientist, critic, gamer, dreamer, and kid-at-heart.

Blog Notes Photos Links Archives About

Learning to SQL

****Since I wasn’t a graduate of computer science, there were many concepts of software development I really only got exposed to when I started working. One of those was the concept of a relational database, and hence SQL. The company I worked at gave all new hires a training regimen that started with about a week of SQL. Despite not knowing anything about it beforehand, I took to it like a mouse takes to cheese. At the end of the week, they give you an exam to see how well you understood the material, and my batchmate who I took the course with likes to tell people that halfway through the allocated time for the exam, I was already taking a nap.

The fact that in my first project I had to port a lot of reports meant I was able to apply what I learned and learn even more easily. I would find as time went on that these bespoke projects typically included hundreds of different reports. The users would always want to see so-and-so data over time or compared with other sets of data, or summary reports of data over weeks, months, years, and so on. And when these systems are being developed, the reports are typically among the last to be finalized, which means that the database structure is almost always optimized towards the needs of the functions needed to maintain the data, and not towards reporting on that data. This typically led to reports that had really long and complicated SQL statements that would look intimidating to newbies. I didn’t mind, I treated each large SQL as a puzzle to be solved and understood.

About a year and a half later, I got tapped to be a trainer in the company for the first time, and I would be training people on SQL. Training other people gave me a better appreciation of how hard SQL was for other people. When I was in college, I understood that certain subjects could be considered as “filters” that many people will not pass and should take as a sign that the course of study was not for them. An example was the Introduction to Computer Science class that most engineering students have to take. Programming is hard and needs a certain mindset, so many people would fail or do poorly in that course, and if you were a Computer Science major it was a clear sign as to whether you should continue that course or not. SQL had that same property – I found out that it was a pretty good indicator of whether certain developers would do well in the company or not.

As I mentioned, the company’s SQL training course was a week long. You could typically tell which trainees would be having trouble by the third day – that was the day that joins would start to be discussed. Joins where were the men got separated from the boys as it were. SQL joins could not be treated the same way as other programming problems, they were not best solved in an iterative or procedural matter, but rather should be approached by an understanding of set theory. You had to figure out the particular set of results you wanted to retrieve and what set operations needed to be done on the data in order to get those results. Most of the trainees having trouble with SQL would find it difficult to wrap their heads around this concept.

Our SQL training was decidedly Oracle-specific, instead of adhering strictly to the ANSI syntax. We preferred teaching Oracle-specific syntax: DECODE instead of CASE, joining by saying “from TABLE1, TABLE2” instead of the more modern “from TABLE1 join TABLE2 on … “. I am not even sure if at that time Oracle already supported the more modern/standardized syntax.

Another thing about the SQL course was that we had to teach SQL performance tuning. It was a subject matter that was hard to teach because providing examples was so difficult. SQL that has a performance problem is typically long, complicated, and involves a number of tables that would require a certain level of domain knowledge. So we would end up providing simple examples and a list of tips for writing SQL (avoid sorts, full scans, distincts, etc) and we had to hope that the trainees would remember these tips when they start writing their own SQL for their projects. Even after a decade later, I would find myself glaring at developers who never learned to avoid using UNION instead of UNION ALL where possible.

For projects with a lot of reports, SQL performance was a big deal, especially when reports were expected to be generated interactively by the user online (as opposed to in the background by batch)

SQL performance tuning involved a relatively straightforward procedure: review the execution plan for the SQL (can be extracted from the database), and try to minimize the number of full table scans and unnecessary sorts. Oracle execution plans typically will even give you a cost for each operation, making it easy to identify which operations are more important to optimize. You also needed some idea of the volume of data on each table. It was more important to minimize full scans and sorts for larger data sets.

To remove full table scans, you typically had to introduce new indexes. You had to identify which WHERE conditions could benefit the most from an index. Some WHERE conditions could not use an index, and you would need to figure out if there was a way to rewrite the conditions that would allow an index to be used. Minimizing sorts means removing unnecessary set operations that require sorting (typically UNION, INTERSECT, MINUS, DISTINCT, that kind of thing). Sometimes the order in which tables in the SQL were evaluated would also matter, so you had to rewrite the SQL such that the faster operations would be done first in order to reduce the data volume before the more expensive operations come in.

Sometimes all of these tips and tricks wouldn’t work and you would spend hours on end trying to figure out how to get an SQL performing better. I once spent more than a day in my first project trying to figure out how to tune a report that was taking more than one day to execute. The usual tricks weren’t working, but I eventually figured out that it would be faster to separate the data into two subsets and perform different filtering conditions on each set, then combine using a UNION ALL. (In Oracle a UNION ALL performed no sorting, so it was relatively safe). It was the first real SQL tuning I was proud of, my team leader was amazed when the succeeding runs of the report took a matter of seconds.

Posted by under post at #Software Development
Also on: twitter / 0 / 1137 words

See Also