I am looking for a way to run frequencies for a number of different columns. I'm used to working in SAS and SQL is still new. In SAS I'd run a proc freq; table var1 var2 var3; run;
command; however, it appears as though this requires a bit more code in SQL. Because of the number of variables/columns I'm obtaining frequencies for, I really don't want to have to write out the same code over and over again. Eg:
select
language,
count(*) as freq
from #pop
group by language
order by language
I've been looking at cursor commands to loop through column names, so I start with generating a new temporary table with all the column names that I'd like to loop through. Here's what I've been able to put together:
drop table if exists #dem_table;
create table #dem_table (dem_vars varchar(50) not null);
insert into #dem_table values ('age_cat'),('race'),('ethnicity'),('language'),('sex'),('orientation'),('income'),('urban'),('poverty'),('children'),('marital_status'),('state');
declare @column varchar(50);
declare cursor_dem cursor for
select * from #dem_table;
open cursor_dem;
fetch next from cursor_dem into @column;
while @@fetch_status = 0
begin
select @column, count(*) as freq
from #pop
group by @column
order by @column;
fetch next from cursor_dem into @column;
end;
close cursor_dem;
deallocate cursor_dem;
It works until I reach the "group by" and "order by" syntax, and then I receive the following errors:
Each GROUP BY expression must contain at least one column that is not an outer reference. The SELECT item identified by the ORDER BY number 1 contains a variable as part of the expression identifying a column position. Variables are only allowed when ordering by an expression referencing a column name.
Wondering if dynamic sql might be helpful? But I can't get it to iterate through columns that way either.
This could probably be pared down a bit, or adjusted to accept a specific list of columns if you are interested only in a subset (For instance, you probably don't care about the distribution of a surrogate key, so you could exclude that. Or maybe you want to exclude
float
columns), but here is a snippet I use when I want to find the distinct values in a table, and the counts thereof.I find this approach the easiest to reason about, and since I'm typically running this in an ad-hoc fashion, I usually don't care that I have to re-hit the table n times for each column.
EDIT If you really want as many result sets as you have columns to analyze, you could replace the insert into
#DistVals
with a simple select. I don't know your exact use case, but to me, having a single dataset containing all columns is more useful.That said, if you want to get everything in one go, You could also do that by doing a dynamic
unpivot
.