Friday, September 17, 2010

Analytic function in postgresql

Few function in oracle personally i like much more because of simplicity to achieve what we want, one of those function is analytic/windows functions. In postgresql it works 8.4 and later version only.

Key points:

       Analytic functions give aggregate result they do not group the result set.
They return the group value multiple times with each record. As such any
other non-"group by" column or expression can be present in the select clause.
Analytic functions are computed after all joins, WHERE clause,
GROUP BY and HAVING are computed on the query. The main ORDER BY clause of the query
operates after the analytic functions, So analytic functions can only appear in the
select list and in the main ORDER BY clause of the query.


The general syntax of analytic function is:

function_name ([expression [, expression ... ]]) OVER ( window_definition )

where window_definition has the syntax

[ existing_window_name ]

[ PARTITION BY expression [, ...] ]

[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]

Let see simple example for each function

1 avg(salary) over (partition by dept)


avg() return (average salary of the partition and) group value multiple time for each record with average salary for partition.

SELECT dept,salary, avg(salary) over (partition by dept) FROM test ORDER BY dept,salary DESC;

dept | salary | avg

------------+--------+-----------------------

accounting | 3700 | 2508.3333333333333333

accounting | 2950 | 2508.3333333333333333

accounting | 2950 | 2508.3333333333333333

accounting | 2100 | 2508.3333333333333333

accounting | 1700 | 2508.3333333333333333

accounting | 1650 | 2508.3333333333333333

it | 3700 | 2700.0000000000000000

it | 3250 | 2700.0000000000000000

it | 2600 | 2700.0000000000000000

it | 1250 | 2700.0000000000000000

management | 3250 | 2962.5000000000000000

management | 3250 | 2962.5000000000000000

management | 2950 | 2962.5000000000000000

management | 2400 | 2962.5000000000000000

marketing | 1650 | 1325.0000000000000000

marketing | 1000 | 1325.0000000000000000

(16 rows)


2) rank() over(partition by dept order by salary desc)


rank() is kind-of like rownum. It returns which position given row gets when ordering by (whatever we put in “order by” clause inside “over ()”. ) . rank() is not really rownum – it’s not unique.
In case of a tie of 2 records at position N, RANK declares 2 positions N and skips position N+1 and gives position N+2 to the next record. While DENSE_RANK declares 2 positions N but does not skip position N+1.

don=# SELECT person,dept,salary, rank() over(partition by dept order by salary desc), dense_rank() over(partition by dept order by salary desc) FROM test ORDER BY dept,salary DESC;

person | dept | salary | rank | dense_rank

-----------+------------+--------+------+------------

phoebe | accounting | 3700 | 1 | 1

lincoln | accounting | 2950 | 2 | 2

rachel | accounting | 2950 | 2 | 2

benjamin | accounting | 2100 | 4 | 3

michael | accounting | 1700 | 5 | 4

paul | accounting | 1650 | 6 | 5

alexander | it | 3700 | 1 | 1

lj | it | 3250 | 2 | 2

sara | it | 2600 | 3 | 3

ross | it | 1250 | 4 | 4

theodore | management | 3250 | 1 | 1

joey | management | 3250 | 1 | 1

monica | management | 2950 | 3 | 2

chandler | management | 2400 | 4 | 3

fernando | marketing | 1650 | 1 | 1

bradley | marketing | 1000 | 2 | 2

(16 rows)


3) row_number() over(partition by deptno order by hiredate desc)

ROW_NUMBER( ) gives a running serial number to a partition of records. It is very useful in reporting, especially in places where different partitions have their own serial numbers.

don=# select dept,salary,row_number() over ( partition by dept order by salary desc) from test;

dept | salary | row_number

------------+--------+------------

accounting | 3700 | 1

accounting | 2950 | 2

accounting | 2950 | 3

accounting | 2100 | 4

accounting | 1700 | 5

accounting | 1650 | 6

it | 3700 | 1

it | 3250 | 2

it | 2600 | 3

it | 1250 | 4

management | 3250 | 1

management | 3250 | 2

management | 2950 | 3

management | 2400 | 4

marketing | 1650 | 1

marketing | 1000 | 2

(16 rows)4)

4) count(salary) over( partition by dept)


same like avg(), but it return count value for each partition

don=# select dept,salary,count(*) over ( partition by dept order by salary desc) from test;

dept | salary | count

------------+--------+-------

accounting | 3700 | 6

accounting | 2950 | 6

accounting | 2950 | 6

accounting | 2100 | 6

accounting | 1700 | 6

accounting | 1650 | 6

it | 3700 | 4

it | 3250 | 4

it | 2600 | 4

it | 1250 | 4

management | 3250 | 4

management | 3250 | 4

management | 2950 | 4

management | 2400 | 4

marketing | 1650 | 2

marketing | 1000 | 2

(16 rows)

5)
LEAD and LAG

LEAD has the ability to compute an expression on the next rows (rows which are going to come after the current row) and return the value to the current row. The general syntax of LEAD is shown below:

LEAD (, , ) OVER ()

is the expression to compute from the leading row.
is the index of the leading row relative to the current row.
is a positive integer with default 1.
is the value to return if the points to a row outside the partition range.

SELECT deptno, empno, sal,

LEAD(sal, 1, 0) OVER (PARTITION BY dept ORDER BY sal DESC NULLS LAST) NEXT_LOWER_SAL,

LAG(sal, 1, 0) OVER (PARTITION BY dept ORDER BY sal DESC NULLS LAST) PREV_HIGHER_SAL

FROM emp WHERE deptno IN (10, 20) ORDER BY deptno, sal DESC;



DEPTNO EMPNO SAL NEXT_LOWER_SAL PREV_HIGHER_SAL

------- ------ ----- -------------- ---------------

10 7839 5000 2450 0

10 7782 2450 1300 5000

10 7934 1300 0 2450

20 7788 3000 3000 0

20 7902 3000 2975 3000

20 7566 2975 1100 3000

20 7876 1100 800 2975

20 7369 800 0 1100



8 rows selected.




full query


SELECT person, dept, salary, avg(salary) over (partition by dept), rank() over(partition by dept order by salary desc),row_number() over(partition by dept order by salary desc) , count(salary) over( partition by dept ) FROM test ORDER BY dept,salary DESC ;



No comments: