Для примера беру лог squid:
21/Dec/2016:03:13:08 +0300] 502 89.111.178.135 TCP_MISS/200/200 64880 POST http://venera-plus.ru/xmlrpc.php venera-plus.ru 91.218.228.248 text/xml [21/Dec/2016:03:13:08 +0300] 98 193.232.145.36 TCP_MISS/301/301 203 GET http://dvmg.ru/check.htm?id=1b3cb64ad33846f19387b761ac056a7b dvmg.ru 91.195.240.135 - [21/Dec/2016:03:13:08 +0300] 215 89.111.185.190 TCP_MISS/200/200 582 GET http://atcdbnzyby.tnx.net/users/at/atcdbnzyby/cat-alog.ru/5/2fJnc9NTA=.txt atcdbnzyby.tnx.net 208.91.197.27 text/html [21/Dec/2016:03:13:08 +0300] 201 89.111.177.70 TCP_MISS/200/200 578 GET http://ipages.tnx.net/users/ip/ipages/denttorg.ru/e/71/Dg0.txt ipages.tnx.net 208.91.197.27 text/html [21/Dec/2016:03:13:08 +0300] 0 89.111.176.66 TCP_MEM_HIT/302/- 576 GET http://fprcorp.com/minilink.php?md=2&nl=ivarga.ru//index.php/index.php?option=com_coppermine*Itemid=88888888&host=ivarga.ru - - text/html [21/Dec/2016:03:13:08 +0300] 644 80.93.62.199 TCP_MISS/200/200 38187 GET http://btuan03.newproduct10.online/weilai.php?mt=0&yid=19&lid=33398&from=http%3A%2F%2Fwww.zavod-kmz.ru%2Fincludes%2Farise%2F33398%2Fimages%2F%2Bbuttons&jump=0&action=&cache=1 btuan03.newproduct10.online 104.27.169.223 text/html |
Копируем его в HDFS:
$ hdfs dfs -put /tmp/access.log /user/hive/ |
Переключаемся в консоль Hive (не забываем, что подключаться надо через ZooKeeper!):
$ beeline -u "jdbc:hive2://master1.hadoop.stage.int.nic.ru:2181,master2.hadoop.stage.int.nic.ru:2181/; serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive Connecting to jdbc:hive2://master1.hadoop.stage.int.nic.ru:2181,master2.hadoop.stage.int.nic.ru:2181/; serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 17/01/10 12:18:02 [main]: INFO jdbc.HiveConnection: Connected to master1.hadoop.stage.int.nic.ru:10000 Connected to: Apache Hive (version 1.1.0-cdh5.9.0) Driver: Hive JDBC (version 1.1.0-cdh5.9.0) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.1.0-cdh5.9.0 by Apache Hive 0: jdbc:hive2://master1.hadoop.stage.int.nic.> |
Создаем нашу таблицу:
0: jdbc:hive2://master1.hadoop.stage.int.nic.> CREATE TABLE IF NOT EXISTS fz139 (date STRING, tz STRING, timework INT, remote_ip STRING, TCP_Code STRING, size INT, method STRING, url STRING, other STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE; INFO : Compiling command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76): CREATE TABLE IF NOT EXISTS fz139 (date STRING, tz STRING, timework INT, remote_ip STRING, TCP_Code STRING, size INT, method STRING, url STRING, other STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76); Time taken: 0.852 seconds INFO : Executing command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76): CREATE TABLE IF NOT EXISTS fz139 (date STRING, tz STRING, timework INT, remote_ip STRING, TCP_Code STRING, size INT, method STRING, url STRING, other STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20170110122727_935cc72b-ab33-409a-a0b9-2f0f68393e76); Time taken: 0.235 seconds INFO : OK No rows affected (1.447 seconds)
0: jdbc:hive2://master1.hadoop.stage.int.nic.> show tables; INFO : Compiling command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a): show tables INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a); Time taken: 0.09 seconds INFO : Executing command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a): show tables INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20170110122828_7f2259af-adff-43b1-ad5a-207d2db41d0a); Time taken: 0.029 seconds INFO : OK +---------------+--+ | tab_name | +---------------+--+ | fz139 | +---------------+--+ |
Загружаем данные в таблицу:
0: jdbc:hive2://master1.hadoop.stage.int.nic.> LOAD DATA INPATH '/user/hive/access.log' OVERWRITE INTO TABLE fz139; INFO : Compiling command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c): LOAD DATA INPATH '/user/hive/access.log' OVERWRITE INTO TABLE fz139 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c); Time taken: 0.03 seconds INFO : Executing command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c): LOAD DATA INPATH '/user/hive/access.log' OVERWRITE INTO TABLE fz139 INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Loading data to table default.fz139 from hdfs://nameservice1/user/hive/access.log INFO : Starting task [Stage-1:STATS] in serial mode INFO : Table default.fz139 stats: [numFiles=1, numRows=0, totalSize=171828911, rawDataSize=0] INFO : Completed executing command(queryId=hive_20170110124444_ba5c336a-18b3-48e4-8776-9153cc6b915c); Time taken: 0.465 seconds INFO : OK No rows affected (0.538 seconds) |
Посчитаем сколько каких кодов возврата с удаленных ip было получено:
0: jdbc:hive2://master1.hadoop.stage.int.nic.> select tcp_code,count(remote_ip) from fz139 group by tcp_code; INFO : Compiling command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641): select tcp_code,count(remote_ip) from fz139 group by tcp_code INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tcp_code, type:string, comment:null), FieldSchema(name:_c1, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641); Time taken: 0.211 seconds INFO : Executing command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641): select tcp_code,count(remote_ip) from fz139 group by tcp_code INFO : Query ID = hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks not specified. Estimated from input data size: 3 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=<number> INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=<number> INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces=<number> INFO : number of splits:2 INFO : Submitting tokens for job: job_1484035485729_0003 INFO : The url to track the job: http://master2.hadoop.stage.int.nic.ru:8088/proxy/application_1484035485729_0003/ INFO : Starting Job = job_1484035485729_0003, Tracking URL = http://master2.hadoop.stage.int.nic.ru:8088/proxy/application_1484035485729_0003/ INFO : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1484035485729_0003 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 3 INFO : 2017-01-10 12:59:02,111 Stage-1 map = 0%, reduce = 0% INFO : 2017-01-10 12:59:11,672 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.74 sec INFO : 2017-01-10 12:59:20,150 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 9.07 sec INFO : 2017-01-10 12:59:21,197 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.03 sec INFO : MapReduce Total cumulative CPU time: 13 seconds 30 msec INFO : Ended Job = job_1484035485729_0003 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 2 Reduce: 3 Cumulative CPU: 13.03 sec HDFS Read: 169784770 HDFS Write: 1376 SUCCESS INFO : Total MapReduce CPU Time Spent: 13 seconds 30 msec INFO : Completed executing command(queryId=hive_20170110125858_0910ba3e-e197-4e7d-a580-36106c85b641); Time taken: 33.968 seconds INFO : OK +----------------------------------+---------+--+ | tcp_code | _c1 | +----------------------------------+---------+--+ | TCP_MEM_HIT/200/- | 4448 | | TCP_MEM_HIT/302/- | 801 | | TCP_MISS/200/200 | 522901 | | TCP_MISS/206/206 | 6 | | TCP_MISS/303/303 | 22 | | TCP_MISS/402/402 | 287 | | TCP_MISS/405/405 | 741 | | TCP_MISS/408/408 | 3 | | TCP_MISS/417/417 | 6 | | TCP_MISS/429/429 | 98 | | TCP_MISS/502/502 | 69 | | TCP_MISS/504/- | 8 | | TCP_MISS/520/520 | 79 | | TCP_MISS/523/523 | 19 | …….. 61 rows selected (34.283 seconds) |