Применение HIVE на примере анализа лога

Для примера беру лог squid:

21/Dec/2016:03:13:08 +0300] 502 89.111.178.135 TCP_MISS/200/200 64880 POST http://venera-plus.ru/xmlrpc.php venera-plus.ru 91.218.228.248 text/xml 

[21/Dec/2016:03:13:08 +0300] 98 193.232.145.36 TCP_MISS/301/301 203 GET http://dvmg.ru/check.htm?id=1b3cb64ad33846f19387b761ac056a7b dvmg.ru 91.195.240.135 -

[21/Dec/2016:03:13:08 +0300] 215 89.111.185.190 TCP_MISS/200/200 582 GET http://atcdbnzyby.tnx.net/users/at/atcdbnzyby/cat-alog.ru/5/2fJnc9NTA=.txt atcdbnzyby.tnx.net 208.91.197.27 text/html

[21/Dec/2016:03:13:08 +0300] 201 89.111.177.70 TCP_MISS/200/200 578 GET http://ipages.tnx.net/users/ip/ipages/denttorg.ru/e/71/Dg0.txt ipages.tnx.net 208.91.197.27 text/html

[21/Dec/2016:03:13:08 +0300] 0 89.111.176.66 TCP_MEM_HIT/302/- 576 GET http://fprcorp.com/minilink.php?md=2&nl=ivarga.ru//index.php/index.php?option=com_coppermine*Itemid=88888888&host=ivarga.ru - - text/html

[21/Dec/2016:03:13:08 +0300] 644 80.93.62.199 TCP_MISS/200/200 38187 GET http://btuan03.newproduct10.online/weilai.php?mt=0&yid=19&lid=33398&from=http%3A%2F%2Fwww.zavod-kmz.ru%2Fincludes%2Farise%2F33398%2Fimages%2F%2Bbuttons&jump=0&action=&cache=1 btuan03.newproduct10.online 104.27.169.223 text/html

Копируем его в HDFS:

$ hdfs dfs -put /tmp/access.log /user/hive/

Переключаемся в консоль Hive (не забываем, что подключаться надо через ZooKeeper!):

$ beeline -u "jdbc:hive2://master1.hadoop.stage.int.nic.ru:2181,master2.hadoop.stage.int.nic.ru:2181/; serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive

Connecting to jdbc:hive2://master1.hadoop.stage.int.nic.ru:2181,master2.hadoop.stage.int.nic.ru:2181/; serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

17/01/10 12:18:02 [main]: INFO jdbc.HiveConnection: Connected to master1.hadoop.stage.int.nic.ru:10000

Connected to: Apache Hive (version 1.1.0-cdh5.9.0)

Driver: Hive JDBC (version 1.1.0-cdh5.9.0)

Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline version 1.1.0-cdh5.9.0 by Apache Hive

0: jdbc:hive2://master1.hadoop.stage.int.nic.>

Создаем нашу таблицу:

0: jdbc:hive2://master1.hadoop.stage.int.nic.> CREATE TABLE IF NOT EXISTS fz139 (date STRING, tz STRING, timework INT, remote_ip STRING, TCP_Code STRING, size INT, method STRING, url STRING, other STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE;

INFO : Compiling command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76): CREATE TABLE IF NOT EXISTS fz139 (date STRING, tz STRING, timework INT, remote_ip STRING, TCP_Code STRING, size INT, method STRING, url STRING, other STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE

INFO : Semantic Analysis Completed

INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)

INFO : Completed compiling command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76); Time taken: 0.852 seconds

INFO : Executing command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76): CREATE TABLE IF NOT EXISTS fz139 (date STRING, tz STRING, timework INT, remote_ip STRING, TCP_Code STRING, size INT, method STRING, url STRING, other STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE

INFO : Starting task [Stage-0:DDL] in serial mode

INFO : Completed executing command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76); Time taken: 0.235 seconds

INFO : OK

No rows affected (1.447 seconds)

 

0: jdbc:hive2://master1.hadoop.stage.int.nic.> show tables;

INFO : Compiling command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a): show tables

INFO : Semantic Analysis Completed

INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)

INFO : Completed compiling command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a); Time taken: 0.09 seconds

INFO : Executing command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a): show tables

INFO : Starting task [Stage-0:DDL] in serial mode

INFO : Completed executing command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a); Time taken: 0.029 seconds

INFO : OK

+---------------+--+

| tab_name |

+---------------+--+

| fz139 |

+---------------+--+

 

Загружаем данные в таблицу:

0: jdbc:hive2://master1.hadoop.stage.int.nic.> LOAD DATA INPATH '/user/hive/access.log' OVERWRITE INTO TABLE fz139;

INFO : Compiling command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c): LOAD DATA INPATH '/user/hive/access.log' OVERWRITE INTO TABLE fz139

INFO : Semantic Analysis Completed

INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)

INFO : Completed compiling command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c); Time taken: 0.03 seconds

INFO : Executing command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c): LOAD DATA INPATH '/user/hive/access.log' OVERWRITE INTO TABLE fz139

INFO : Starting task [Stage-0:MOVE] in serial mode

INFO : Loading data to table default.fz139 from hdfs://nameservice1/user/hive/access.log

INFO : Starting task [Stage-1:STATS] in serial mode

INFO : Table default.fz139 stats: [numFiles=1, numRows=0, totalSize=171828911, rawDataSize=0]

INFO : Completed executing command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c); Time taken: 0.465 seconds

INFO : OK

No rows affected (0.538 seconds)

Посчитаем сколько каких кодов возврата с удаленных ip было получено:

0: jdbc:hive2://master1.hadoop.stage.int.nic.> select tcp_code,count(remote_ip) from fz139 group by tcp_code;

INFO : Compiling command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641): select tcp_code,count(remote_ip) from fz139 group by tcp_code

INFO : Semantic Analysis Completed

INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tcp_code, type:string, comment:null), FieldSchema(name:_c1, type:bigint, comment:null)], properties:null)

INFO : Completed compiling command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641); Time taken: 0.211 seconds

INFO : Executing command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641): select tcp_code,count(remote_ip) from fz139 group by tcp_code

INFO : Query ID = hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641

INFO : Total jobs = 1

INFO : Launching Job 1 out of 1

INFO : Starting task [Stage-1:MAPRED] in serial mode

INFO : Number of reduce tasks not specified. Estimated from input data size: 3

INFO : In order to change the average load for a reducer (in bytes):

INFO : set hive.exec.reducers.bytes.per.reducer=<number>

INFO : In order to limit the maximum number of reducers:

INFO : set hive.exec.reducers.max=<number>

INFO : In order to set a constant number of reducers:

INFO : set mapreduce.job.reduces=<number>

INFO : number of splits:2

INFO : Submitting tokens for job: job_1484035485729_0003

INFO : The url to track the job: http://master2.hadoop.stage.int.nic.ru:8088/proxy/application_1484035485729_0003/

INFO : Starting Job = job_1484035485729_0003, Tracking URL = http://master2.hadoop.stage.int.nic.ru:8088/proxy/application_1484035485729_0003/

INFO : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1484035485729_0003

INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 3

INFO : 2017-01-10 12:59:02,111 Stage-1 map = 0%, reduce = 0%

INFO : 2017-01-10 12:59:11,672 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.74 sec

INFO : 2017-01-10 12:59:20,150 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 9.07 sec

INFO : 2017-01-10 12:59:21,197 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.03 sec

INFO : MapReduce Total cumulative CPU time: 13 seconds 30 msec

INFO : Ended Job = job_1484035485729_0003

INFO : MapReduce Jobs Launched:

INFO : Stage-Stage-1: Map: 2 Reduce: 3 Cumulative CPU: 13.03 sec HDFS Read: 169784770 HDFS Write: 1376 SUCCESS

INFO : Total MapReduce CPU Time Spent: 13 seconds 30 msec

INFO : Completed executing command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641); Time taken: 33.968 seconds

INFO : OK

+----------------------------------+---------+--+

| tcp_code | _c1 |

+----------------------------------+---------+--+

| TCP_MEM_HIT/200/- | 4448 |

| TCP_MEM_HIT/302/- | 801 |

| TCP_MISS/200/200 | 522901 |

| TCP_MISS/206/206 | 6 |

| TCP_MISS/303/303 | 22 |

| TCP_MISS/402/402 | 287 |

| TCP_MISS/405/405 | 741 |

| TCP_MISS/408/408 | 3 |

| TCP_MISS/417/417 | 6 |

| TCP_MISS/429/429 | 98 |

| TCP_MISS/502/502 | 69 |

| TCP_MISS/504/- | 8 |

| TCP_MISS/520/520 | 79 |

| TCP_MISS/523/523 | 19 |

……..

61 rows selected (34.283 seconds)

unix-way