本文共 2365 字,大约阅读时间需要 7 分钟。
作为一名 Hive 开发人员,理解并掌握不同场景下的 SQL 查询技巧至关重要。本文将通过三个不同的练习,帮助读者从基础到进阶,以实际案例的形式,逐步提升在 Hive 中的操作能力。
为了满足每日用户总数统计的需求,我们可以执行以下 Hive 查询:
SELECT logday, COUNT(DISTINCT userid) AS day_totalFROM test_windowGROUP BY logday;
我们需要计算从第一天到现在,每天第一个大于80分用户的累计人数。此时,HIVE 中没有提供截止时间,默认是到当前日期。因此,查询如下:
WITH current_date AS ('SELECT CURRENT_TIMESTAMP()')SELECT logday, COUNT(*) AS totalFROM test_windowWHERE score > 80ORDER BY logdayWINDOW partitions by (logday) ORDER BY logday BEWEEN UNBOUNDED PRECEDING AND CURRENT ROWAS OF current_date; 要计算每位用户的分数大于80分的天数,可以使用 partitions by userid 注脚:
WITH current_date AS ('SELECT CURRENT_TIMESTAMP()')SELECT userid, COUNT(*) AS total_daysFROM test_windowWHERE score > 80GROUP BY useridWITH partitions by (logday) ORDER BY logday BEWEEN UNBOUNDED PRECEDING AND CURRENT ROWAS OF current_date; SELECT name, COUNT(*) over () AS month_usersFROM businessWHERE SUBSTRING(orderdate, 1, 7) = '2017-04';
SELECT orderdate, sum(cost) over () AS daily_totalFROM businessGROUP BY orderdate;
WITH current_date AS ('SELECT CURRENT_TIMESTAMP()')SELECT name, sum(cost) over (DISTRIBUTE BY name) AS total_amountFROM businessWINDOW partitions by(name) ORDER BY orderdate BEWEEN UNBOUNDED PRECEDING AND CURRENT ROWAS OF current_date; WITH current_date AS ('SELECT CURRENT_TIMESTAMP()')SELECT name, lag(orderdate, 1, '1970-01-01') over (PARTITION BY name ORDER BY orderdate) AS last_purchase_dateFROM business; 行号方法:使用 row_number()
SELECT *, row_number() over (PARTITION BY subject ORDER BY score DESC) AS row_num, rank() over (PARTITION BY subject ORDER BY score DESC) AS rank, dense_rank() over (PARTITION BY subject ORDER BY score DESC) AS dense_rankFROM score;
WITH ranked_data AS ( SELECT *, row_number() over (PARTITION BY subject ORDER BY score DESC) AS rmp FROM score )SELECT * FROM ranked_data WHERE rmp <= 3;
通过以上实例,读者可以从基本的聚合操作到高级的窗口函数,逐步掌握 Hive 中的复杂查询。这不仅是技术的学习,更是提升解决实际问题能力的基础。
转载地址:http://xnpaz.baihongyu.com/