Thursday, November 12, 2009

Subqueries Primer: Find Employees With Salary Higher Than Their Department Average

This is a classic SQL problem- probably more a textbook exercise than a real-life business problem, but a classic nonetheless. "Find all employees whose salary is higher than the average salary for their department."

In Access, the most straightforward approach is a multi-query solution. However it can be also be accomplished in a single query using subqueries. I'll discuss both approahces here

Assume our data in tblEmployeeInfo is like this:

EmployeeID DepartmentID Salary
1 1 $30,000.00
2 1 $25,000.00
3 1 $40,000.00
4 1 $33,000.00
5 2 $25,000.00
6 2 $50,000.00

Two Query Solution

First create a query that finds average salary by department. I'll call this qryAvgSalaryByDepartment. This can be easily created in the query grid:

Or here's the SQL:

SELECT 
  DepartmentID, 
  Avg(tblEmployeeInfo.Salary) AS DeptAvgSalary
FROM tblEmployeeInfo
GROUP BY DepartmentID;

When we run the query we see the following results:

The second query simply joins the original salary data to this totals query (using DepartmentID as the common field) and uses a criteria to show only employees above the average salary for their department:

or
SELECT EmployeeID, tblEmployeeInfo.DepartmentID, Salary
FROM tblEmployeeInfo
  INNER JOIN qryAvgSalaryByDepartment
  ON tblEmployeeInfo.DepartmentID = qryAvgSalaryByDepartment.DepartmentID
WHERE Salary > DeptAvgSalary;
and will yield the final results we were looking for

Subquery Solution

You can use a query with a subquery to do the two steps of the two query solution in one query. First he's the SQL:

SELECT EmployeeID, DepartmentID, Salary
FROM tblEmployeeInfo AS T1
  WHERE Salary > 
  ( SELECT Avg(Salary) 
    FROM tblEmployeeInfo as T2
    WHERE T1.DepartmentID=T2.DepartmentID )
;
When we run this query we get the same results as the two query solution. Let's look at it in detail to understand how it works.

Inside the brackets you have a complete query- that's the subquery. It looks a lot like the first query we created. That query had a GROUP BY clause that told it to give us the average for each department. This query has a WHERE clause that tells it to give us the average for only one department at a time (more about how it knows which department to give us in a minute. The subquery's result is a single value- the average salary. That value can then be used in the main query. (Not all subqueries return a single value, some return a set of values.)

How does the subquery know which department to reurn an average salary for? It's WHERE clause tells it to select only records from the department of the employee being processed in the main query.WHERE T1.DepartmentID=T2.DepartmentID The main query will process each record in tblEmployeeInfo. And for each record it will run the subquery to get an average salary for their department. The WHERE clause tells the query to use a claue from the main query as the criteria in the subquery.

T1 and T2 are aliases we give the tables being referred to in the main query and the subquery. The reason we need aliases is that they're the same table- tblEmployeeInfo. By using the AS keyword we can specify a name we want to use to refer to the two different instances of the table in out query. I just used T1 and T2 for table 1 and table 2. You could be more imaginative. If the subquery doesn't use the same table as the main query you don't need an alias. In SQL texbooks these aliases are sometimes called carelation IDs when they're used to match (or corelate) records in the main and subquery.

And just because in SQL there's always another way to do it, here's another subquery approach which uses a JOIN instead of a filter based on a filed in the outer query. In my mind this is less "elegant". But I'm not sure I have a specific reason for this. I haven't researched it, but I have a hunch that the solution using JOIN is not supported in all SQL implementations. Here's the SQL:

SELECT T1.EmployeeID, T1.DepartmentID, T1.Salary
FROM tblEmployeeInfo AS T1
  INNER JOIN
  ( SELECT DepartmentID, Avg(Salary) AS AvgSalary
    FROM tblEmployeeInfo 
    GROUP BY DepartmentID ) AS T2   
  ON   T1.DepartmentID = T2.DepartmentID
WHERE Salary > AvgSalary
;

See also: all my posts about subqueries.

No comments: