O uso do limpa-neve reduzirá seus custos analíticos. Este é o primeiro artigo, com instruções detalhadas sobre como configurar todo o processo de transferência de eventos de um aplicativo móvel para o banco de dados RedShift. No próximo artigo, veremos mais de perto como montar um painel para visualizar os dados coletados.
O uso do limpa-neve reduzirá seus custos analíticos. Este é o primeiro artigo, com instruções detalhadas sobre como configurar todo o processo de transferência de eventos de um aplicativo móvel para o banco de dados RedShift. No próximo artigo, veremos mais de perto como montar um painel para visualizar os dados coletados.
O conteúdo do artigo do The Startup
Founder's Guide to Analytics Tristan Handy é excelente como uma introdução ao artigo , há uma tradução em Habré https://habr.com/ru/post/346326/
O autor aconselha o uso da ferramenta Snowplow para análises :
“Migre dos sistemas existentes de análise e rastreamento de eventos para o Snowplow Analytics. Snowplow faz
tudo o que as
ferramentas pagas fazem, mas é open source. Você pode hospedar você mesmo (e
apenas pagar o custo de suas instâncias EC2) ou pagar para hospedar o coletor de eventos no Snowplow ou
Fivetran. Se você não der o salto neste estágio, não será capaz de coletar dados muito mais detalhados e se
preparar para algumas contas de Segmento, Heap ou Mixpanel realmente enormes em um futuro próximo. Depois de passar por esse
estágio, as ferramentas pagas podem facilmente cobrar US $ 10.000 por mês. "
, . Simo Ahava snowplow,
, snowplow
snowplow
, 2 .
:
- Linux / Unix ( Terminal Mac OS X).
- Git — , Snowplow.
- Amazon Web Services 12 .
- .
- (
snowplow.denjoy.ru), DNS ( ). - Android Snowplow
Tracker
- .
?
, :
:
- , Clojure Collector.
- - AWS Elastic Beanstalk,
AWS Route 53.
- AWS S3.
- , ETL (extract,
transform, load), AWS EMR,
S3.
- AWS Redshift.
, .
0: AWS IAM-
- AWS
. ,
.
AMR
, , Amazon Web Services IAM (Identity and Access
Management) , .
(IAM)
, , :
IAM.
IAM Snowplow.
Services IAM .
- «Groups».
- «Create new group» .
-
snowplow-setup«Next step». - «Attach Policy», «Next step».
- «Create Group».
«Policy».
- «Create Policy».
- JSON :
{
"Version": "2012-10-17",
"Statement": [
{ "Effect":
"Allow",
"Action": [
"acm:*",
"autoscaling:*",
"aws-marketplace:ViewSubscriptions",
"aws-marketplace:Subscribe",
"aws-marketplace:Unsubscribe",
"cloudformation:*",
"cloudfront:*",
"cloudwatch:*",
"ec2:*",
"elasticbeanstalk:*",
"elasticloadbalancing:*",
"elasticmapreduce:*",
"es:*",
"iam:*",
"rds:*",
"redshift:*",
"s3:*",
"sns:*"
],
"Resource": "*"
}
]
}
- «Review policy».
-
snowplow-setup-policy-infrastructure. - «Create Policy».
«Groups» «snowplow-setup», .
- Permissions «Attach Policy».
-
Snowplow-setup-policy-Infrastructure«Attach Policy».
«Users» «Add user».
-
snowplow-setup. - «Programmatic access».
- «Next: Permissions».
- «Add user to group»,
snowplow-setup, «Next: Tags» - «Next: Tags»
- «Create user».
, , – . CSV, «Download .csv».
, , . , , .
, 0 !
- AWS.
- IAM-
snowplow-setup.
1: Clojure collector
- DNS
Clojure Collector — , web-endpoint, . -, Apache Tomcat, AWS Elastic Beanstalk. Clojure Collector Tomcat AWS S3, , Clojure Collector, .
Clojure Collector
, , WAR Clojure Collector.
. clojure-collector-1.X.X-standalone.war.
, Elastic Beanstalk.
AWS Services Elastic Beanstalk.
, AWS, Snowplow, . , . .
Elastic Beanstalk
- «Create Application».
- (,
Snowplow Clojure Collector). - Platform Tomcat, Tomcat 8.5 with Java 8 running on 64bit Amazon Linux
- Application Code «Upload your code» WAR-.
- «Create application»
- ,
Clojure Collector , . , Applications cookie sp. , .
! Clojure Collector.
.
S3
Tomcat S3 – . -, HTTP-, , .
S3, Elastic Beanstalk. Elastic Beanstalk AWS.
- .
- «Edit» «Software Configuration».
- «S3 log storage» «Rotate logs».
, , S3 ETL.
«Apply», .
, Elastic Beanstalk - auto-scalable, .
- «Configuration» .
- «Capacity» «Edit».
- «Environment Type» , «Load balanced», , .
, .
Elastic Beanstalk SSL
.
- Services AWS «Route 53» .
- «Create hosted zone».
- Domain Name , .
snowplow.denjoy.ru. «Public Hosted Zone» «Create hosted zone».
- . NS. .
- , NS , cloudflare.
- 4 NS- . CloudFlare:
, NS- snowplow.denjoy.ru, NS AWS. .
-, , https://dnschecker.org/.
, , Route 53, . , Route 53 Elastic Beanstalk. , URL- snowplow.denjoy.ru , DNS AWS, - Clojure Collector. !
- , «Create Record».
- «Simple Routing»
- «Define simple record»
- Na janela que se abre, deixe o campo Nome do registro em branco, no campo Valor / Rota do tráfego para, selecione "Alias para o ambiente Elastic Beanstalk", no campo seguinte, selecione a região, no campo Tipo de registro, selecione "registros A" e clique no botão "Definir registro simples" no canto inferior da janela
<img src = " denjoy.storage.yandexcloud.net/snowplow1/image7.png " alt = "image7"
- Após fechar a janela, clique no botão "Criar registros"
Agora, se você abrir em um navegador http://snowplow.denjoy.ru/i, deverá ver o mesmo pixel de quando abrir a página do Coletor Clojure. Portanto, o roteamento de domínio funciona!
Mas ainda não terminamos.
Configurando HTTPS para Clojure Collector
() SSL- AWS Load Balancer. , Route 53, . SSL
- Services AWS Certificate Manager. «Provision certificates» «Get started»
- «Request a public certificate»
- , .
snowplow.denjoy.ru«Next» - «DNS validation»
- Tags
- «Review» «Confirm and request»
- . , AWS , «Create record in Route 53»
- «Create»
Create . «Continue» . 30 , !
Load Balancer HTTPS
- Elastic Beanstalk, «Configuration». !
- «Load balancer» «Edit»
- «Listeners» «Add listener»
- Port 443, «Add».
- «Apply»
!
Snowplow Clojure Collector (, ).
, , .
— . Route 53, .
- Clojure Collector, Elastic Beanstalk.
- , Amazon Route 53.
- SSL- .
- Tomcat S3. S3 .
2:
Android Tracker . Tracker Demo, , , «Ok» .
, https://snowplow.denjoy.ru, HTTPS «Start». .
.
Clojure Collector, Elastic Beanstalk, Tomcat S3. , S3
S3 elasticbeanstalk-region-id. resources / environment / logs / publish / (some ID) / (some ID). Some ID – , , e-ab12cd23ef, , , i-1234567890. gzip.
, _var_log_tomcat8_rotated_localhost_access_log.txt123456789.gz – , ETL .
, . HTTP- 200. , , Clojure Collector . . :
, JSON .
3. ETL
- Clojure Collector.
- IAM, 0 .
.
, , AWS Elastic MapReduce (EMR).
- Tomcat.
- , IP-.
- , schema JSON.
- , , Amazon Redshift.
. , ETL S3-. , , . Tomcat , , .
Java- EmrEtlRunner . ETL Amazon Elastic MapReduce. , EmrEtlRunner . , , , 60 .
EmrEtlRunner
ETL — Unix, . , , snowplow_emr_rXX, XX — . snowplow_emr_r117_biskupin.zip.
- ZIP-
snowplow-emr-etl-runner. . - Snowplow Github , SQL, .
- , ,
snowplow-emr-etl-runner, :
git clone https://github.com/snowplow/snowplow.git
-
snowplow-emr-etl-runnersnowplow . -
configtargets. - :
-
snowplow/3-enrich/emr-etl-runner/config/config.yml.sampleconfig/config.yml. -
snowplow/3-enrich/config/iglu_resulver.jsonconfig/iglu_resulver.json. -
snowplow/4-storage/config/targets/redshift.jsonconfig/targets/redshift.json.
-
:
|-- snowplow-emr-etl-runner |-- snowplow | |-- -SNOWPLOW GIT REPO HERE- |-- config | |-- iglu_resolver.json | |-- config.yml | |-- targets | | |-- redshift.json
EC2
Amazon EC2. ETL Amazon, Amazon EC2. ETL , , .
- AWS Services EC2. «Key Pairs» .
- , , . .
- , , «Create key pair».
- .
denjoy-snowplow. - pem
- , , <key pair name>.pem .
S3
Amazon S3. ETL.
:
:raw:in— . -elasticbeanstalk, Clojure Collector’, Elastic Beanstalk.:processin— .:archive— ::raw( ),:enriched( ):shredded( ).:enriched— ::good( ),:bad( , ).:shredded— ::good( , ),:bad( , ).:log— , ETL.
, S3, Services AWS S3.
:raw:in , elasticbeanstalk-.
, « » ETL.
«Create bucket» , denjoy-snowplow-data. S3, snowplow. «Next» , , , «Create bucket».
, . :
«Create folder» :
archiveshreddedenriched
archive :
rawenrichedshredded
, enriched, shredded, :
goodbad
, , :
|-- elasticbeanstalk-region-id |-- denjoy-snowplow-data | |-- archive | | |-- raw | | |-- enriched | | |-- shredded | |-- encriched | | |-- good | | |-- bad | |-- shredded | | |-- good | | |-- bad
S3 denjoy-snowplow-log. , ETL.
EmrEtlRunner
EmrEtlRunner. config.yml , snowplow config/. :
-
snowplow-setup, 0. , AWS IAM.
- AWS. ,
Python/pip, Mac OS X, Homebrew. , Homebrew,brew install awscliAWS.
, awscli, aws configure . , , , , eu-west-1.
$ aws configure AWS Access Key ID: <enter your IAM user Access Key ID here> AWS Secret Access Key: <enter you IAM user Secret Access Key here> Default region name: <enter the region name, e.g. eu-west-1 here> Default output format: <just press enter>
aws configure aws emr create-default-rules. - EmrEtlRunner, EC2.
EmrEtlRunner!
EmrEtlRunner
EmrEtlRunner — snowplow-emr-etl-runner.
EmrEtlRunner . . . , 13, rdb_load. . .
EmrEtlRunner config.yml, config. , , , .
aws:
access_key_id: AKIAIBAWU2NAYME55123
secret_access_key: iEmruXM7dSbOemQy63FhRjzhSboisP5TcJlj9123
s3:
region: eu-west-1
buckets:
assets: s3://snowplow-hosted-assets
jsonpath_assets:
log: s3://simoahava-snowplow-log
raw:
in:
- s3://elasticbeanstalk-eu-west-1-375284143851/resources/environments/logs/publish/e-f4pdn8dtsg
processing: s3://simoahava-snowplow-data/processing
archive: s3://simoahava-snowplow-data/archive/raw
enriched:
good: s3://simoahava-snowplow-data/enriched/good
bad: s3://simoahava-snowplow-data/enriched/bad
errors:
archive: s3://simoahava-snowplow-data/archive/enriched
shredded:
good: s3://simoahava-snowplow-data/shredded/good
bad: s3://simoahava-snowplow-data/shredded/bad
errors:
archive: s3://simoahava-snowplow-data/archive/shredded
emr:
ami_version: 5.9.0
region: eu-west-1
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement:
ec2_subnet_id: subnet-d6e91a9e
ec2_key_name: simoahava
bootstrap: []
software:
hbase:
lingual:
jobflow:
job_name: Snowplow ETL
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
core_instance_ebs:
volume_size: 100
volume_type: "gp2"
volume_iops: 400
ebs_optimized: false
task_instance_count: 0
task_instance_type: m1.medium
task_instance_bid: 0.015
bootstrap_failure_tries: 3
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
spark:
maximizeResourceAllocation: "true"
additional_info:
collectors:
format: clj-tomcat
enrich:
versions:
spark_enrich: 1.12.0
continue_on_unexpected_error: false
output_compression: NONE
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.0
hadoop_elasticsearch: 0.1.0
monitoring:
tags: {}
logging:
level: DEBUG
, , , . -. , , .
:aws:access_key_id
|
IAM. |
:aws:secret_access_key
|
IAM. |
:aws:s3:region
|
, S3. |
:aws:s3:buckets:log
|
S3, ETL. |
-:aws:s3:buckets:raw:in
|
, Tomcat. . ! , ! |
:aws:s3:buckets:raw:processing
|
. |
:aws:s3:buckets:raw:archive
|
. |
:aws:s3:buckets:enriched:good
|
. |
:aws:s3:buckets:enriched:bad
|
. |
:aws:s3:buckets:enriched:errors
|
. |
:aws:s3:buckets:enriched:archive
|
. |
:aws:s3:buckets:shredded:good
|
. |
:aws:s3:buckets:shredded:bad
|
. |
:aws:s3:buckets:shredded:errors
|
. |
:aws:s3:buckets:shredded:archive
|
|
:aws:emr:region
|
, EC2. |
:aws:emr:placement
|
. |
:aws:emr:ec2_subnet_id
|
VDS, . , EC2, . |
:aws:emr:ec2_key_name
|
EC2. |
:collectors:format
|
clj-tomcat. |
:monitoring:snowplow
|
(:method, :app_id :collector). |
.
-, :aws:s3:buckets:raw:in . . , . , .
:aws:emr:ec2_subnet_id , Services AWS EC2. «Instances», . «subnet» aws:emr:ec2_subnet_id.
, .
, , , snowplow-emr-etl-runner.
./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json
Invalid InstanceProfile: EMR_EC2_DefaultRule.
ETL S3. .
ETL, AWS Redshift, !
snowplow-emr-etl-runner.- S3-.
- ETL S3.
4: Redshift
- ETL .
- S3-.
- GUI SQL-. Table Plus, , . .
Redshift. Redshift — , AWS. , , Tomcat. SQL . , SQL, Codecademy, SQL!
:
- Redshift.
- .
- EmrEtlRunner Redshift.
, , EmrEtlRunner, . SQL- ( ) Snowplow: .
AWS Amazon Redshift.
, ( , ). «Launch Cluster».
. snowplow-cluster. . snowplow.
Node type dc2.large, Cluster type Single Node 1 .
- (5439).
-. , , . - — .
-.
, «Create cluster».
.
. Redshift.
, , , .
«Clusters» , .
«Properties» «Network and security» VPC security groups ( sg-c3f5c687).
EC2.
.
«Inbound rules» , TCP- 5439 0.0.0.0/0 . , TCP- ( ).
, .
. Amazon Redshift . .
SQL. Table Plus. «Create new connection» :
- : Amazon Redshift (
com.amazon.redshift.jdbc.Driver) - Host:
endpoint - User:
awsuser - Password:
master_password - Database:
snowplow
-, .
:
«Connect», .
SELECT current_database(); «Run current», , . :
– !
-, , Android Tracker. .sql , DDL, .
.sql , Snowplow:
- snowplow/4-storage/redshift-storage/sql/atomic-def.sql
- snowplow/4-storage/redshift-storage/sql/manifest-def.sql
atomic-def.sql Table Plus. atomic atomic.events.
manifest-def.sql. .
DDL . , ETL , .
.sql :
- https://github.com/snowplow/iglu-central/tree/master/sql/com.snowplowanalytics.mobile
- https://github.com/snowplow/iglu-central/tree/master/sql/com.snowplowanalytics.snowplow
, SQL- , :
SELECT * FROM pg_tables WHERE schemaname='atomic';
:
storageloader, ETL.power_user, , -.read_only, .
SQL-. ($password) , + .
CREATE USER storageloader PASSWORD '$password'; GRANT USAGE ON SCHEMA atomic TO storageloader; GRANT INSERT ON ALL TABLES IN SCHEMA atomic TO storageloader; CREATE USER read_only PASSWORD '$password'; GRANT USAGE ON SCHEMA atomic TO read_only; GRANT SELECT ON ALL TABLES IN SCHEMA atomic TO read_only; CREATE SCHEMA scratchpad; GRANT ALL ON SCHEMA scratchpad TO read_only; CREATE USER power_user PASSWORD '$password'; GRANT ALL ON DATABASE snowplow TO power_user; GRANT ALL ON SCHEMA atomic TO power_user; GRANT ALL ON ALL TABLES IN SCHEMA atomic TO power_user;
, 12 .
, , atomic storageLoader, .
, :
SELECT 'ALTER TABLE atomic.' || tablename ||' OWNER TO storageloader;' FROM pg_tables WHERE schemaname='atomic' AND NOT tableowner='storageloader';
:
ALTER TABLE atomic.* OWNER TO storageloader;
.
,
SELECT * FROM pg_tables WHERE schemaname='atomic' AND tableowner='storageloader';
.
, EmrEtlRunner ETL, storageloader- S3 Redshift.
IAM-
EmrEtlRunner Redshift RDB Loader ( ). , IAM-, Redshift S3-.
- , AWS Services IAM.
- Rules. «Create rule».
- «Select type of trusted entity» AWS - Redshift . «Select your use case» «Redshift — Customizable «Next: permissions».
- AmazonS3ReadOnlyAccess . «Next: Tags».
- «Next: review»
- , ,
RedshiftS3Access«Create Rule». - . RedshiftS3Access , . Rule ARN. .
- Amazon Redshift .
- Snowplow « IAM».
- «Available IAM rules» , «Add IAM rule» «Done», .
Redshift
, 3, config/ targets/ redshift.json.
redshift.json , :
{
"schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0",
"data": {
"name": "AWS Redshift enriched events storage",
"host": "ADD HERE",
"database": "ADD HERE",
"port": 5439,
"sslMode": "DISABLE",
"username": "ADD HERE",
"password": "ADD HERE",
"roleArn": "ADD HERE",
"schema": "atomic",
"maxError": 1,
"compRows": 20000,
"sshTunnel": null,
"purpose": "ENRICHED_EVENTS"
}
}
, :
host: URL- Redshiftdatabase:username:storageloaderpassword:storageloaderruleArn: ARN IAM-, .
-.
EmrEtlRunner
, , EmrEtlRunner,
Redshift.
, ( snowplow-emr-etl-runner
):
./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json -t config/targets
:raw:in (, Tomcat)
, , Redshift. ,
.
- :
read_only .
, , , , (
), ,
, Snowplow.
- Amazon, , DNS
AWS.
- Clojure Collector — , HTTP- Tomcat
S3-.
- ETL, ,
S3.
- , ETL , ,
AWS Redshift.
, , , - –
, -.
, , , .
Discourse
Snowplow — , , .
!