Desmistificando o Join no Apache Spark

Olá, Habr. Para os futuros alunos do curso “Ecossistema Hadoop, Spark, Hive” preparamos uma tradução do material.



Também convidamos a todos para o webinar "Testando aplicativos Spark" . Nesta lição aberta, consideraremos os problemas no teste de aplicativos Spark: dados estatísticos, verificação parcial e início / parada de sistemas pesados. Vamos estudar as bibliotecas para a solução e escrever testes.






Este artigo se concentra exclusivamente na operação Join no Apache Spark e fornece uma visão geral da base sobre a qual a tecnologia Spark Join é construída.

As junções são frequentemente usadas em fluxos de mineração de dados típicos para correlacionar dois conjuntos de dados. O Apache Spark, sendo um mecanismo de análise unificado, também forneceu uma base sólida para a execução de uma ampla variedade de cenários de junção.





Join , , , , . ( ) Join , Joined . .





Join:

, Join Apache Spark. :





1) : Join. , Join, Join.





2) Join: , , (Join Condition). () , . , : Join Joins.





, , . . , (A.x == B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





. , . , (A.x < B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





3) Join type: Join Join Join . Join:





(Inner Join): Inner Join Joined ( Join) .





(Outer Join): Outer Join , . , () .





(Semi Join): Semi Join , , , . , , , (Semi Join) (Anti Join).





: Cross Join , .





Join, Apache Spark Join.





Join

, Join, .





Apache Spark Join. :





  • (Shuffle Hash Join)





  • (Broadcast Hash Join)





  • (Sort Merge Join)





  • (Cartesian Join)





  • (Broadcast Nested Loop Join)





Broadcast Hash Join: «Broadcast Hash Join» ( Join) . - , , -.





“Broadcast Hash Join" . , . Spark , .





Shuffle Hash Join: 'Shuffle Hash Join' () ( , ”Guide to Spark Partitioning ( Spark)”. , , (shuffle) Join.





, , Shuffle Hash Join, , Hash Join. , - , -.





"Shuffle Hash Join" "Broadcast Hash Join". , - . , , Join 'Shuffle Hash Join'. , 'Broadcast Hash Join', Spark .





Sort Merge Join: 'Sort Merge Join' 'Shuffle Hash Join'. () . , , (shuffle) Join.





, , Sort Merge Join , Sort Merge Join.





'Sort Merge Join' 'Shuffle Hash Join' 'Broadcast Hash Join', , 'Sort Merge Join' , 'Shuffle Hash' 'Broadcast Hash'. , 'Shuffle Hash Join', , (shuffle) , , 'Sort Merge Join'.





Cartesian Join: Cartesian Join . . , . , .





Cartesian Join . Join, Cartesian - .





Broadcast Nested Loop Join: 'Broadcast Nested Loop Join' . Nested Loop Join .





«Broadcast Nested Loop Join» , . , , .





Spark Join?

Join Join, , Spark :





Spark Join, :









  • Join









  • Join





  • (Equi or Non-Equi Join)





Spark API Join Join Join. Join, 'broadcast', 'merge', 'shuffle_hash' 'shuffle_replicate_nl', , Join.





, Spark Join :





'Broadcast Hash Join'





  • Equi Join





  • 'Full Outer' Join





, :





  • 'Broadcast', Join - 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Right Outer', 'Right Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





'Shuffle Hash Join'





  • Equi Join





  • 'Full Outer' Join





  • 'spark.sql.join.prefersortmergeJoin



    ( true)' false





, :





  • 'shuffle_hash' , Join - 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Right Outer', 'Right Semi' 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





'Sort Merge Join'





  • Equi Join





  • Join Keys, Equi Join,





  • 'spark.sql.join.prefersortmergeJoin ( true)' true.





, :





  • 'merge' , Join .





  • , Join .





'Cartesian Join'





  • 'Inner'





, :





  • 'shuffle_replicate_nl' , Join Equi Non-Equi.





  • , Join Equi Non-Equi.





'Broadcast Nested Loop Join'

'Broadcast Nested Loop Join' - Join ; , 'Broadcast Nested Loop Join' Join Join.





, Join , 'Broadcast Hash Join', 'Sort Merge Join', 'Shuffle Hash Join', 'Cartesian Join'.





Cartesian Broadcast Nested Loop Join, Broadcast Nested Loop Inner, Non-Equi Joins, Cartesian Join, , .





, : Join. , .





, Join Apache Spark. - , , .






« Hadoop, Spark, Hive»





« Spark »








All Articles