O uso do limpa-neve reduzirá seus custos analíticos. Este é o primeiro artigo, com instruções detalhadas sobre como configurar todo o processo de transferência de eventos de um aplicativo móvel para o banco de dados RedShift. No próximo artigo, veremos mais de perto como montar um painel para visualizar os dados coletados.
O uso do limpa-neve reduzirá seus custos analíticos. Este é o primeiro artigo, com instruções detalhadas sobre como configurar todo o processo de transferência de eventos de um aplicativo móvel para o banco de dados RedShift. No próximo artigo, veremos mais de perto como montar um painel para visualizar os dados coletados.
O conteúdo do artigo do The Startup
Founder's Guide to Analytics Tristan Handy é excelente como uma introdução ao artigo , há uma tradução em Habré https://habr.com/ru/post/346326/
O autor aconselha o uso da ferramenta Snowplow para análises :
“Migre dos sistemas existentes de análise e rastreamento de eventos para o Snowplow Analytics. Snowplow faz
tudo o que as
ferramentas pagas fazem, mas é open source. Você pode hospedar você mesmo (e
apenas pagar o custo de suas instâncias EC2) ou pagar para hospedar o coletor de eventos no Snowplow ou
Fivetran. Se você não der o salto neste estágio, não será capaz de coletar dados muito mais detalhados e se
preparar para algumas contas de Segmento, Heap ou Mixpanel realmente enormes em um futuro próximo. Depois de passar por esse
estágio, as ferramentas pagas podem facilmente cobrar US $ 10.000 por mês. "
, . Simo Ahava snowplow,
, snowplow
snowplow
, 2 .
:
- Linux / Unix ( Terminal Mac OS X).
- Git — , Snowplow.
- Amazon Web Services 12 .
- .
- (
snowplow.denjoy.ru
), DNS ( ). - Android Snowplow
Tracker
- .
?
, :

:
- , Clojure Collector.
- - AWS Elastic Beanstalk,
AWS Route 53.
- AWS S3.
- , ETL (extract,
transform, load), AWS EMR,
S3.
- AWS Redshift.
, .
0: AWS IAM-
- AWS
. ,
.
AMR
, , Amazon Web Services IAM (Identity and Access
Management) , .

(IAM)
, , :
IAM.
IAM Snowplow.
Services IAM .

- «Groups».
- «Create new group» .

-
snowplow-setup
«Next step». - «Attach Policy», «Next step».
- «Create Group».
«Policy».
- «Create Policy».

- JSON :
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "acm:*", "autoscaling:*", "aws-marketplace:ViewSubscriptions", "aws-marketplace:Subscribe", "aws-marketplace:Unsubscribe", "cloudformation:*", "cloudfront:*", "cloudwatch:*", "ec2:*", "elasticbeanstalk:*", "elasticloadbalancing:*", "elasticmapreduce:*", "es:*", "iam:*", "rds:*", "redshift:*", "s3:*", "sns:*" ], "Resource": "*" } ] }

- «Review policy».
-
snowplow-setup-policy-infrastructure
. - «Create Policy».
«Groups» «snowplow-setup», .

- Permissions «Attach Policy».
-
Snowplow-setup-policy-Infrastructure
«Attach Policy».


«Users» «Add user».
-
snowplow-setup
. - «Programmatic access».
- «Next: Permissions».
- «Add user to group»,
snowplow-setup
, «Next: Tags» - «Next: Tags»
- «Create user».

, , – . CSV, «Download .csv».

, , . , , .
, 0 !
- AWS.
- IAM-
snowplow-setup
.
1: Clojure collector
- DNS
Clojure Collector — , web-endpoint, . -, Apache Tomcat, AWS Elastic Beanstalk. Clojure Collector Tomcat AWS S3, , Clojure Collector, .
Clojure Collector
, , WAR Clojure Collector.
. clojure-collector-1.X.X-standalone.war
.
, Elastic Beanstalk.
AWS Services Elastic Beanstalk.

, AWS, Snowplow, . , . .

Elastic Beanstalk
- «Create Application».
- (,
Snowplow Clojure Collector
). - Platform Tomcat, Tomcat 8.5 with Java 8 running on 64bit Amazon Linux
- Application Code «Upload your code» WAR-.


- «Create application»
- ,


Clojure Collector , . , Applications cookie sp
. , .

! Clojure Collector.
.
S3
Tomcat S3 – . -, HTTP-, , .
S3, Elastic Beanstalk. Elastic Beanstalk AWS.
- .
- «Edit» «Software Configuration».

- «S3 log storage» «Rotate logs».

, , S3 ETL.
«Apply», .
, Elastic Beanstalk - auto-scalable, .
- «Configuration» .
- «Capacity» «Edit».
- «Environment Type» , «Load balanced», , .

, .
Elastic Beanstalk SSL
.
- Services AWS «Route 53» .
- «Create hosted zone».
- Domain Name , .
snowplow.denjoy.ru
. «Public Hosted Zone» «Create hosted zone».

- . NS. .

- , NS , cloudflare.
- 4 NS- . CloudFlare:

, NS- snowplow.denjoy.ru
, NS AWS. .
-, , https://dnschecker.org/.
, , Route 53, . , Route 53 Elastic Beanstalk. , URL- snowplow.denjoy.ru
, DNS AWS, - Clojure Collector. !
- , «Create Record».

- «Simple Routing»
- «Define simple record»
- Na janela que se abre, deixe o campo Nome do registro em branco, no campo Valor / Rota do tráfego para, selecione "Alias para o ambiente Elastic Beanstalk", no campo seguinte, selecione a região, no campo Tipo de registro, selecione "registros A" e clique no botão "Definir registro simples" no canto inferior da janela
<img src = " denjoy.storage.yandexcloud.net/snowplow1/image7.png " alt = "image7"
- Após fechar a janela, clique no botão "Criar registros"
Agora, se você abrir em um navegador http://snowplow.denjoy.ru/i
, deverá ver o mesmo pixel de quando abrir a página do Coletor Clojure. Portanto, o roteamento de domínio funciona!
Mas ainda não terminamos.
Configurando HTTPS para Clojure Collector
() SSL- AWS Load Balancer. , Route 53, . SSL
- Services AWS Certificate Manager. «Provision certificates» «Get started»
- «Request a public certificate»
- , .
snowplow.denjoy.ru
«Next» - «DNS validation»
- Tags
- «Review» «Confirm and request»
- . , AWS , «Create record in Route 53»

- «Create»

Create . «Continue» . 30 , !
Load Balancer HTTPS
- Elastic Beanstalk, «Configuration». !
- «Load balancer» «Edit»
- «Listeners» «Add listener»
- Port 443, «Add».

- «Apply»
!
Snowplow Clojure Collector (, ).
, , .
— . Route 53, .
- Clojure Collector, Elastic Beanstalk.
- , Amazon Route 53.
- SSL- .
- Tomcat S3. S3 .
2:
Android Tracker . Tracker Demo, , , «Ok» .
, https://snowplow.denjoy.ru, HTTPS «Start». .


.
Clojure Collector, Elastic Beanstalk, Tomcat S3. , S3

S3 elasticbeanstalk-region-id
. resources / environment / logs / publish / (some ID) / (some ID)
. Some ID – , , e-ab12cd23ef
, , , i-1234567890
. gzip.
, _var_log_tomcat8_rotated_localhost_access_log.txt123456789.gz
– , ETL .

, . HTTP- 200
. , , Clojure Collector . . :

, JSON .

3. ETL
- Clojure Collector.
- IAM, 0 .
.
, , AWS Elastic MapReduce (EMR).
- Tomcat.
- , IP-.
- , schema JSON.
- , , Amazon Redshift.
. , ETL S3-. , , . Tomcat , , .
Java- EmrEtlRunner . ETL Amazon Elastic MapReduce. , EmrEtlRunner . , , , 60 .
EmrEtlRunner
ETL — Unix, . , , snowplow_emr_rXX
, XX — . snowplow_emr_r117_biskupin.zip
.
- ZIP-
snowplow-emr-etl-runner
. . - Snowplow Github , SQL, .
- , ,
snowplow-emr-etl-runner
, :
git clone https://github.com/snowplow/snowplow.git

-
snowplow-emr-etl-runner
snowplow . -
config
targets
. - :
-
snowplow/3-enrich/emr-etl-runner/config/config.yml.sample
config/config.yml
. -
snowplow/3-enrich/config/iglu_resulver.json
config/iglu_resulver.json
. -
snowplow/4-storage/config/targets/redshift.json
config/targets/redshift.json
.
-

:
|-- snowplow-emr-etl-runner |-- snowplow | |-- -SNOWPLOW GIT REPO HERE- |-- config | |-- iglu_resolver.json | |-- config.yml | |-- targets | | |-- redshift.json
EC2
Amazon EC2. ETL Amazon, Amazon EC2. ETL , , .
- AWS Services EC2. «Key Pairs» .
- , , . .
- , , «Create key pair».

- .
denjoy-snowplow
. - pem
- , , <key pair name>.pem .

S3
Amazon S3. ETL.
:
:raw:in
— . -elasticbeanstalk
, Clojure Collector’, Elastic Beanstalk.:processin
— .:archive
— ::raw
( ),:enriched
( ):shredded
( ).:enriched
— ::good
( ),:bad
( , ).:shredded
— ::good
( , ),:bad
( , ).:log
— , ETL.
, S3, Services AWS S3.
:raw:in
, elasticbeanstalk-
.
, « » ETL.
«Create bucket» , denjoy-snowplow-data
. S3, snowplow. «Next» , , , «Create bucket».
, . :

«Create folder» :
archive
shredded
enriched

archive
:
raw
enriched
shredded
, enriched
, shredded
, :
good
bad
, , :
|-- elasticbeanstalk-region-id |-- denjoy-snowplow-data | |-- archive | | |-- raw | | |-- enriched | | |-- shredded | |-- encriched | | |-- good | | |-- bad | |-- shredded | | |-- good | | |-- bad
S3 denjoy-snowplow-log
. , ETL.
EmrEtlRunner
EmrEtlRunner. config.yml
, snowplow config/
. :
-
snowplow-setup
, 0. , AWS IAM.
- AWS. ,
Python/pip
, Mac OS X, Homebrew. , Homebrew,brew install awscli
AWS.
, awscli
, aws configure
. , , , , eu-west-1
.
$ aws configure AWS Access Key ID: <enter your IAM user Access Key ID here> AWS Secret Access Key: <enter you IAM user Secret Access Key here> Default region name: <enter the region name, e.g. eu-west-1 here> Default output format: <just press enter>
aws configure
aws emr create-default-rules
. - EmrEtlRunner, EC2.
EmrEtlRunner!
EmrEtlRunner
EmrEtlRunner — snowplow-emr-etl-runner
.
EmrEtlRunner . . . , 13, rdb_load. . .
EmrEtlRunner config.yml
, config
. , , , .
aws: access_key_id: AKIAIBAWU2NAYME55123 secret_access_key: iEmruXM7dSbOemQy63FhRjzhSboisP5TcJlj9123 s3: region: eu-west-1 buckets: assets: s3://snowplow-hosted-assets jsonpath_assets: log: s3://simoahava-snowplow-log raw: in: - s3://elasticbeanstalk-eu-west-1-375284143851/resources/environments/logs/publish/e-f4pdn8dtsg processing: s3://simoahava-snowplow-data/processing archive: s3://simoahava-snowplow-data/archive/raw enriched: good: s3://simoahava-snowplow-data/enriched/good bad: s3://simoahava-snowplow-data/enriched/bad errors: archive: s3://simoahava-snowplow-data/archive/enriched shredded: good: s3://simoahava-snowplow-data/shredded/good bad: s3://simoahava-snowplow-data/shredded/bad errors: archive: s3://simoahava-snowplow-data/archive/shredded emr: ami_version: 5.9.0 region: eu-west-1 jobflow_role: EMR_EC2_DefaultRole service_role: EMR_DefaultRole placement: ec2_subnet_id: subnet-d6e91a9e ec2_key_name: simoahava bootstrap: [] software: hbase: lingual: jobflow: job_name: Snowplow ETL master_instance_type: m1.medium core_instance_count: 2 core_instance_type: m1.medium core_instance_ebs: volume_size: 100 volume_type: "gp2" volume_iops: 400 ebs_optimized: false task_instance_count: 0 task_instance_type: m1.medium task_instance_bid: 0.015 bootstrap_failure_tries: 3 configuration: yarn-site: yarn.resourcemanager.am.max-attempts: "1" spark: maximizeResourceAllocation: "true" additional_info: collectors: format: clj-tomcat enrich: versions: spark_enrich: 1.12.0 continue_on_unexpected_error: false output_compression: NONE storage: versions: rdb_loader: 0.14.0 rdb_shredder: 0.13.0 hadoop_elasticsearch: 0.1.0 monitoring: tags: {} logging: level: DEBUG
, , , . -. , , .
:aws:access_key_id
|
IAM. |
:aws:secret_access_key
|
IAM. |
:aws:s3:region
|
, S3. |
:aws:s3:buckets:log
|
S3, ETL. |
-:aws:s3:buckets:raw:in
|
, Tomcat. . ! , ! |
:aws:s3:buckets:raw:processing
|
. |
:aws:s3:buckets:raw:archive
|
. |
:aws:s3:buckets:enriched:good
|
. |
:aws:s3:buckets:enriched:bad
|
. |
:aws:s3:buckets:enriched:errors
|
. |
:aws:s3:buckets:enriched:archive
|
. |
:aws:s3:buckets:shredded:good
|
. |
:aws:s3:buckets:shredded:bad
|
. |
:aws:s3:buckets:shredded:errors
|
. |
:aws:s3:buckets:shredded:archive
|
|
:aws:emr:region
|
, EC2. |
:aws:emr:placement
|
. |
:aws:emr:ec2_subnet_id
|
VDS, . , EC2, . |
:aws:emr:ec2_key_name
|
EC2. |
:collectors:format
|
clj-tomcat. |
:monitoring:snowplow
|
(:method , :app_id :collector ). |
.
-, :aws:s3:buckets:raw:in
. . , . , .

:aws:emr:ec2_subnet_id
, Services AWS EC2. «Instances», . «subnet» aws:emr:ec2_subnet_id
.

, .
, , , snowplow-emr-etl-runner
.
./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json

Invalid InstanceProfile: EMR_EC2_DefaultRule.
ETL S3. .
ETL, AWS Redshift, !
snowplow-emr-etl-runner
.- S3-.
- ETL S3.
4: Redshift
- ETL .
- S3-.
- GUI SQL-. Table Plus, , . .
Redshift. Redshift — , AWS. , , Tomcat. SQL . , SQL, Codecademy, SQL!
:
- Redshift.
- .
- EmrEtlRunner Redshift.
, , EmrEtlRunner, . SQL- ( ) Snowplow: .
AWS Amazon Redshift.
, ( , ). «Launch Cluster».

. snowplow-cluster
. . snowplow
.
Node type dc2.large
, Cluster type Single Node 1 .
- (5439).
-. , , . - — .
-.
, «Create cluster».

.
. Redshift.

, , , .
«Clusters» , .
«Properties» «Network and security» VPC security groups ( sg-c3f5c687
).

EC2.
.
«Inbound rules» , TCP- 5439
0.0.0.0/0
. , TCP- ( ).
, .

. Amazon Redshift . .

SQL. Table Plus. «Create new connection» :
- : Amazon Redshift (
com.amazon.redshift.jdbc.Driver
) - Host:
endpoint
- User:
awsuser
- Password:
master_password
- Database:
snowplow
-, .
:

«Connect», .
SELECT current_database();
«Run current», , . :

– !
-, , Android Tracker. .sql , DDL, .
.sql , Snowplow:
- snowplow/4-storage/redshift-storage/sql/atomic-def.sql
- snowplow/4-storage/redshift-storage/sql/manifest-def.sql
atomic-def.sql
Table Plus. atomic
atomic.events
.

manifest-def.sql
. .
DDL . , ETL , .
.sql :
- https://github.com/snowplow/iglu-central/tree/master/sql/com.snowplowanalytics.mobile
- https://github.com/snowplow/iglu-central/tree/master/sql/com.snowplowanalytics.snowplow
, SQL- , :
SELECT * FROM pg_tables WHERE schemaname='atomic';

:
storageloader
, ETL.power_user
, , -.read_only
, .
SQL-. ($password
) , + .
CREATE USER storageloader PASSWORD '$password'; GRANT USAGE ON SCHEMA atomic TO storageloader; GRANT INSERT ON ALL TABLES IN SCHEMA atomic TO storageloader; CREATE USER read_only PASSWORD '$password'; GRANT USAGE ON SCHEMA atomic TO read_only; GRANT SELECT ON ALL TABLES IN SCHEMA atomic TO read_only; CREATE SCHEMA scratchpad; GRANT ALL ON SCHEMA scratchpad TO read_only; CREATE USER power_user PASSWORD '$password'; GRANT ALL ON DATABASE snowplow TO power_user; GRANT ALL ON SCHEMA atomic TO power_user; GRANT ALL ON ALL TABLES IN SCHEMA atomic TO power_user;
, 12 .

, , atomic
storageLoader
, .
, :
SELECT 'ALTER TABLE atomic.' || tablename ||' OWNER TO storageloader;' FROM pg_tables WHERE schemaname='atomic' AND NOT tableowner='storageloader';
:
ALTER TABLE atomic.* OWNER TO storageloader;
.

,
SELECT * FROM pg_tables WHERE schemaname='atomic' AND tableowner='storageloader';
.
, EmrEtlRunner ETL, storageloader
- S3 Redshift.
IAM-
EmrEtlRunner Redshift RDB Loader ( ). , IAM-, Redshift S3-.
- , AWS Services IAM.
- Rules. «Create rule».
- «Select type of trusted entity» AWS - Redshift . «Select your use case» «Redshift — Customizable «Next: permissions».

- AmazonS3ReadOnlyAccess . «Next: Tags».

- «Next: review»
- , ,
RedshiftS3Access
«Create Rule». - . RedshiftS3Access , . Rule ARN. .

- Amazon Redshift .
- Snowplow « IAM».

- «Available IAM rules» , «Add IAM rule» «Done», .

Redshift
, 3, config/
targets/
redshift.json
.
redshift.json
, :
{ "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0", "data": { "name": "AWS Redshift enriched events storage", "host": "ADD HERE", "database": "ADD HERE", "port": 5439, "sslMode": "DISABLE", "username": "ADD HERE", "password": "ADD HERE", "roleArn": "ADD HERE", "schema": "atomic", "maxError": 1, "compRows": 20000, "sshTunnel": null, "purpose": "ENRICHED_EVENTS" } }
, :
host
: URL- Redshiftdatabase
:username
:storageloader
password
:storageloader
ruleArn
: ARN IAM-, .
-.
EmrEtlRunner
, , EmrEtlRunner,
Redshift.
, ( snowplow-emr-etl-runner
):
./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json -t config/targets
:raw:in
(, Tomcat)
, , Redshift. ,
.
- :

read_only .

, , , , (
), ,
, Snowplow.
- Amazon, , DNS
AWS.
- Clojure Collector — , HTTP- Tomcat
S3-.
- ETL, ,
S3.
- , ETL , ,
AWS Redshift.
, , , - –
, -.
, , , .
Discourse
Snowplow — , , .
!