Aggregation
count
count(*)
to get the row number, or count(col)
to get the number of rows when col
is not NULL
count_distinct
count_distinct(col)
to get the number of unique values for the col
column. Same as count(distinct col)
count_if
count_if(condition)
to apply a filter with condition
and get the number of records. e.g. count_if(speed_kmh>80)
distinct
distinct(col)
to get the distinct value for the col
column.
unique
unique(<column_name1>[, <column_name2>, ...])
: Calculates the approximate number of different values of the columns.
unique_exact
unique_exact(<column_name1>[, <column_name2>, ...])
Calculates the exact number of different values of the columns.
unique_exact_if
unique_exact_if(col,condition)
to apply a filter with condition
and get the distinct count of col
, e.g. to get the cars with high speed unique_exact_if(cid,speed_kmh>80)
min
min(<column_name>)
: minimum value of a column. For String column, the comparison is lexicographic order.
max
max(<column_name>)
: maximum value of a column. For String column, the comparison is lexicographic order.
sum
sum(<column_name>)
: sum of the columns. Only works for numerics.
avg
avg(<column_name>)
: average value of a column (sum(column) / count(column)). Only works for numeric columns.
median
median(<column_name>)
Calculate median of a numeric data sample.
quantile
quantile(column,level)
Calculate an approximate quantile of a numeric data sequence. e.g. quantile(a,0.9)
to get the P90 for the column and quantile(a,0.5)
to get the median number