Wednesday, November 2, 2011

Notes on GROUP BY in MySQL

Here is a query I wanted to run, but I was concerned that the value of fat.rowsprocessed would not come from the same fat row as min(fat.processeddate).

select m.domain, ma.merchantacctid, ma.createddate, 
fat.rowsprocessed, min(fat.processeddate)
from merchant_account ma
join merchant m on m.merchantacctid = ma.merchantacctid
join ftp_audit_trail fat on fat.merchantacctid = 
    ma.merchantacctid
where fat.processeddate > ma.createddate 
and fat.rowsprocessed > 0
and ma.createddate > '2009-12-31'
group by fat.merchantacctid
order by domain;

My buddy Spencer pointed out that in standard SQL, if you use an aggregate function, then you have to include all the other fields you are selecting in the group by. It turns out that there is an extension to GROUP BY, and to HAVING, in MySQL that enables you to use them on a single field:

MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
11.15.3. GROUP BY and HAVING with Hidden Columns

I was afraid that the db would pick any value of rowsprocessed, that it would not come from the same row that the min(processeddate) is from.

select m.domain, ma.merchantacctid, ma.createddate, 
fat.rowsprocessed, fat.processeddate
from merchant_account ma
join merchant m on m.merchantacctid = ma.merchantacctid
join ftp_audit_trail fat on fat.merchantacctid = 
    ma.merchantacctid
where fat.processeddate > ma.createddate 
and fat.rowsprocessed > 0
and ma.createddate > '2009-12-31'
group by fat.merchantacctid
having min(fat.processeddate)
order by domain;

HAVING is what I wanted to use. It's also not standard SQL legal but the same MySQL extension enables this.

I ran both, exported the csv's, and diff'd them, and they gave identical results. But I suspect that was luck in this case, and that the first query would not be dependably unarbitrary. I am more comfortable with the second query, using HAVING.
 

1 comment:

Spencer Kormos said...

I probably would've came to the same conclusion
if (exists (select 1 from spencer having min(freetime)>0))