Olá, Habr. Para os futuros alunos do curso “Ecossistema Hadoop, Spark, Hive” preparamos uma tradução do material.
Também convidamos a todos para o webinar "Testando aplicativos Spark" . Nesta lição aberta, consideraremos os problemas no teste de aplicativos Spark: dados estatísticos, verificação parcial e início / parada de sistemas pesados. Vamos estudar as bibliotecas para a solução e escrever testes.
Este artigo se concentra exclusivamente na operação Join no Apache Spark e fornece uma visão geral da base sobre a qual a tecnologia Spark Join é construída.
As junções são frequentemente usadas em fluxos de mineração de dados típicos para correlacionar dois conjuntos de dados. O Apache Spark, sendo um mecanismo de análise unificado, também forneceu uma base sólida para a execução de uma ampla variedade de cenários de junção.
Join , , , , . ( ) Join , Joined . .
Join:
, Join Apache Spark. :
1) : Join. , Join, Join.
2) Join: , , (Join Condition). () , . , : Join Joins.
, , . . , (A.x == B.x) ((A.x == B.x) (A.y == B.y)) - x, y A B, Join.
. , . , (A.x < B.x) ((A.x == B.x) (A.y == B.y)) - x, y A B, Join.
3) Join type: Join Join Join . Join:
(Inner Join): Inner Join Joined ( Join) .
(Outer Join): Outer Join , . , () .
(Semi Join): Semi Join , , , . , , , (Semi Join) (Anti Join).
: Cross Join , .
Join, Apache Spark Join.
Join
, Join, .
Apache Spark Join. :
(Shuffle Hash Join)
(Broadcast Hash Join)
(Sort Merge Join)
(Cartesian Join)
(Broadcast Nested Loop Join)
Broadcast Hash Join: «Broadcast Hash Join» ( Join) . - , , -.
“Broadcast Hash Join" . , . Spark , .
Shuffle Hash Join: 'Shuffle Hash Join' () ( , ”Guide to Spark Partitioning ( Spark)”. , , (shuffle) Join.
, , Shuffle Hash Join, , Hash Join. , - , -.
"Shuffle Hash Join" "Broadcast Hash Join". , - . , , Join 'Shuffle Hash Join'. , 'Broadcast Hash Join', Spark .
Sort Merge Join: 'Sort Merge Join' 'Shuffle Hash Join'. () . , , (shuffle) Join.
, , Sort Merge Join , Sort Merge Join.
'Sort Merge Join' 'Shuffle Hash Join' 'Broadcast Hash Join', , 'Sort Merge Join' , 'Shuffle Hash' 'Broadcast Hash'. , 'Shuffle Hash Join', , (shuffle) , , 'Sort Merge Join'.
Cartesian Join: Cartesian Join . . , . , .
Cartesian Join . Join, Cartesian - .
Broadcast Nested Loop Join: 'Broadcast Nested Loop Join' . Nested Loop Join .
«Broadcast Nested Loop Join» , . , , .
Spark Join?
Join Join, , Spark :
Spark Join, :
Join
Join
(Equi or Non-Equi Join)
Spark API Join Join Join. Join, 'broadcast', 'merge', 'shuffle_hash' 'shuffle_replicate_nl', , Join.
, Spark Join :
'Broadcast Hash Join'
Equi Join
'Full Outer' Join
, :
'Broadcast', Join - 'Right Outer', 'Right Semi' 'Inner'.
, '
spark.sql.autoBroadcastJoinThreshold
( 10 )' Join - 'Right Outer', 'Right Semi', or 'Inner'.
'Broadcast' , Join - 'Left Outer', 'Left Semi' 'Inner'.
, '
spark.sql.autoBroadcastJoinThreshold
( 10 )' Join - 'Left Outer', 'Left Semi', or 'Inner'.
'Broadcast' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.
, '
spark.sql.autoBroadcastJoinThreshold
( 10 )' Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.
'Shuffle Hash Join'
Equi Join
'Full Outer' Join
'
spark.sql.join.prefersortmergeJoin
( true)' false
, :
'shuffle_hash' , Join - 'Right Outer', 'Right Semi', 'Inner'.
, , Join - 'Right Outer', 'Right Semi' 'Inner'.
'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Inner'.
, , Join - 'Left Outer', 'Left Semi', 'Inner'.
'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.
, , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.
'Sort Merge Join'
Equi Join
Join Keys, Equi Join,
'spark.sql.join.prefersortmergeJoin ( true)' true.
, :
'merge' , Join .
, Join .
'Cartesian Join'
'Inner'
, :
'shuffle_replicate_nl' , Join Equi Non-Equi.
, Join Equi Non-Equi.
'Broadcast Nested Loop Join'
'Broadcast Nested Loop Join' - Join ; , 'Broadcast Nested Loop Join' Join Join.
, Join , 'Broadcast Hash Join', 'Sort Merge Join', 'Shuffle Hash Join', 'Cartesian Join'.
Cartesian Broadcast Nested Loop Join, Broadcast Nested Loop Inner, Non-Equi Joins, Cartesian Join, , .
, : Join. , .
, Join Apache Spark. - , , .