🧕🏻 👩🏼‍🤝‍👨🏽 👂🏻 Treinamento em dados tabulares. TABNet. Parte 1 🤚🏾 🎏 🌏

Queríamos apresentar a tradução de um artigo interessante sobre a aprendizagem usando redes neurais em dados tabulares. A segunda parte está aqui.

Resumidamente

Apresenta o TabNet, uma nova arquitetura de aprendizado profundo canônico de alto desempenho baseada em dados tabulares. O TabNet usa avaliações sequenciais da escolha de recursos a serem usados em cada ponto de decisão. Isso garante a interpretabilidade e eficiência do processo de aprendizagem, uma vez que a capacidade de aprender é determinada pelas funções mais relevantes (as mais adequadas, de acordo com as estimativas consideradas da escolha da solução). Foi demonstrado que o TabNet supera outras redes neurais e arquiteturas de árvore de decisão em uma ampla gama de conjuntos de dados escalares tabulares na interpretação de atributos de desempenho, levando a uma compreensão do comportamento geral do modelo. Finalmente, pela primeira vez, pelo que sabemos,demonstramos aprendizagem auto-supervisionada para dados tabulares com um aumento significativo na taxa de aprendizagem e um conjunto de dados inicial suficientemente grande.

1. Introdução

Redes neurais profundas (GNNs) têm mostrado seu sucesso ao trabalhar com imagens [21, 50], texto [9, 34] e som [1, 56]. Para esses tipos de dados, o principal fator de desenvolvimento é a disponibilidade de arquiteturas canônicas que permitem codificar com eficiência as sequências iniciais em sequências de treinamento, para fornecer alto desempenho em novos conjuntos de dados e tarefas resolvidas com a sua ajuda com recursos mínimos. Por exemplo, na interpretação de imagens, variantes de redes convolucionais residuais (em particular, ResNet [21]) devem fornecer um desempenho razoavelmente bom ao trabalhar com novos conjuntos de dados para imagens ou problemas de reconhecimento visual relacionados (por exemplo, classificação, taxonomia). O único tipo de dado em que o sucesso da arquitetura canônica do GNS ainda não foi alcançado são os dados tabulares. Apesar deque é o tipo de dados mais comum em implementações de IA [8], o aprendizado profundo para dados tabulares permanece pouco compreendido e variantes de árvores de decisão de ensemble ainda dominam a maioria das aplicações [28]. Porque isto é assim? Primeiro, porque as abordagens baseadas em árvore têm certas vantagens que as tornam populares: (i) são suficientemente representativas (e, portanto, frequentemente altamente eficientes) para variedades de decisão com limites de distribuição de hiperplanos difusos para dados tabulares; (ii) são bem interpretados (por exemplo, rastreando decisões nodais) e existem métodos eficazes para a explicação a posteriori da forma de seu conjunto, o que é [36] uma tarefa importante em muitas aplicações do mundo real (por exemplo, em serviços financeiros, onde confiar em ações de alto risco é crítico);(iii) eles aprendem rapidamente. Em segundo lugar, as arquiteturas GNS propostas anteriormente não são adaptáveis aos dados tabulares: GNS comum em camadas convolucionais ou perceptrons multicamadas (MLP) são frequentemente altamente parametrizados (pelo número de parâmetros e pela complexidade de sua identificação) - a ausência de um viés indutivo apropriado leva ao fato de que eles não são pode encontrar a solução ótima para a variedade de soluções tabulares [17]. Por que estudar Deep Learning para dados tabulares? Uma razão óbvia é que, como em outras áreas, ganhos de desempenho podem ser esperados de arquiteturas baseadas em GNS, especialmente para grandes conjuntos de dados [22]. Além disso, ao contrário do aprendizado em árvore (hierárquico), que não usa retropropagação de erros de dados para conduzir o aprendizado eficaz a partir de sinais errôneosGNNs fornecem estratégias de aprendizagem de gradiente descendente ponta a ponta para dados tabulares, com muitas vantagens demonstradas em muitas áreas diferentes, permitindo: (i) codificar com eficiência muitos tipos de dados, como imagens na forma de dados tabulares; (ii) facilitar ou eliminar a necessidade de desenvolvimento de recursos, que atualmente é um aspecto-chave dos métodos de aprendizagem baseados em árvore usando dados tabulares; (iii) treinar em streaming de dados - o treinamento em uma estrutura de árvore requer estatísticas globais para selecionar pontos nodais, e modificações simples, como em [4], geralmente fornecem menor precisão em comparação ao treinamento para toda a amostra de dados; Em contraste, STSs demonstram maior potencial para aprendizagem ao longo da vida [44]; (iv) explorar modelos de apresentação ponta a ponta,permitindo novos cenários valiosos para novas aplicações, incluindo adaptação às áreas de uso eficiente de dados [17], modelagem generativa [46] e aprendizagem de professor parcial [11].

, , . , ? - TabNet, « » ( ) ( ). , TabNet : . , - , . , : (1) , TabNet ; (2) TabNet , , , , (. . 1); , , , , [6] [61], Tab-Net .

1. TabNet [14]. , . TabNet , . . , , , .

(3) , : (a) TabNet ; (b) TabNet : , , , .

(4) , , (. . 2).

2.

: , , () . , LASSO [20], , , . , [6] , [61] «-» . , TabNet , () , .

: . [18]. , (). – [23], . XGBoost [7] LightGBM [30] - , (Data Science). , , , .

DNN : , [26], . () [33, 58] . , . [60] , . [31] -, , , . [53] - « » (, ), . TabNet , .

: - , [3, 35] . , .

: , , [47]. [13] [55] - .

3. () (). . , ( , ) ReLU , . . C1 C2, - Softmax ( ).

3. TABNET

. (. . 3 ). . , () . TabNet - . , , , :

(i) , ; (ii) , , ; (iii) ; (iv) .

4. ) TabNet , , . , , . , . (b) TabNet, . (c) – 4- , 2 2 . (, Fully-Connected) (Batch Normalization) (Gted Linear Unit). (d) – , , . sparsemax [37] .

. 4 TabNet . . . , (). D-

$f \ in R ^ {(B × D)}$

, B- . TabNet N .

i- (i - 1)- , , . (, [25]) [40] .

, . ( ) , . .

$M [i] ∈ R ^ {(B × D)}$

. , , , . , M[i] · f. (. . 4) , , a[i − 1]:

$M [i] = sparsemax (P [i - 1] · h_i (a [i - 1])) \ (1)$

Sparsemax [37] , .

, 1

$\ sum_ {j = 1} ^ {D} M [i] _b, _j = 1$

h[i] - , . 4., FC, BN, P[i] - , , :

$P [i] = \ prod_ {j = 1} ^ {i = 1} (\ gamma - M [j]), \ (2)$

γ - : γ = 1, γ, . P[0] ,

$1 ^ {B × D}$

- . ( ), P[0] , . [19]:

$L_ {sparse} = \ sum_ {i = 1} ^ {N_ {passos}} \ sum_ {b = 1} ^ {B} \ sum_ {j = 1} ^ {D} \ frac {-M_ {b, j } [i]} {N_ {etapas} * B} log (M_ {b, j} {[i]} + \ epsilon)$

ϵ- . λ . , .

: (. . 4) ,

$[d [i], a [i]] = fi (M [i] · f), onde \ d [i] ∈ R ^ {B × N_d} \ e \ a [i] ∈ R ^ {B × N_a }$

, ( ), , .

. 4 . FC BN (GLU) [12], . √0.5 , , [15]. . BN, , , BN [24] BV mB. , , BN. , , . 3,

$d_ {out} = \ sum_ {i = 1} ^ {N_ {passos}} ReLU (d [i])$

$W_ {final} d_ {out}$

. softmax ( argmax ).

TABNet.

, , , .

[1] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, et al. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595 (2015).

[2] AutoML. 2019. AutoML Tables – Google Cloud. https://cloud.google.com/automl-tables/

[3] J. Bao, D. Tang, N. Duan, Z. Yan, M. Zhou, and T. Zhao. 2019. Text Generation From Tables. IEEE Trans Audio, Speech, and Language Processing 27, 2 (Feb 2019), 311–320.

[4] Yael Ben-Haim and Elad Tom-Tov. 2010. A Streaming Parallel Decision Tree Algorithm. JMLR 11 (March 2010), 849–872.

[5] Catboost. 2019. Benchmarks. https://github.com/catboost/benchmarks. Accessed: 2019-11-10.

[6] Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. 2018. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation. arXiv:1802.07814 (2018).

[7] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD.

[8] Michael Chui, James Manyika, Mehdi Miremadi, Nicolaus Henke, Rita Chung, et al. 2018. Notes from the AI Frontier. McKinsey Global Institute (4 2018).

[9] Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, and Yann LeCun. 2016. Very Deep Convolutional Networks for Natural Language Processing. arXiv:1606.01781 (2016).

[10] Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. 2016. AdaNet: Adaptive Structural Learning of Artificial Neural Networks. arXiv:1607.01097 (2016).

[11] Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, and Ruslan Salakhutdinov. 2017. Good Semi-supervised Learning that Requires a Bad GAN. arxiv:1705.09783 (2017).

[12] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language Modeling with Gated Convolutional Networks. arXiv:1612.08083 (2016).

[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018).

[14] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http: //archive.ics.uci.edu/ml

[15] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. arXiv:1705.03122 (2017).

[16] Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine Learning 63, 1 (01 Apr 2006), 3–42.

[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

[18] K. Grabczewski and N. Jankowski. 2005. Feature selection with decision tree criterion. In HIS.

[19] Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised Learning by Entropy Minimization. In NIPS.

[20] Isabelle Guyon and Andre Elisseeff. 2003. An Introduction to Variable and Feature ´ Selection. JMLR 3 (March 2003), 1157–1182.

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 (2015).

[22] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. Deep Learning Scaling is Predictable, Empirically. arXiv:1712.00409 (2017).

[23] Tin Kam Ho. 1998. The random subspace method for constructing decision forests. PAMI 20, 8 (Aug 1998), 832–844.

[24] Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv:1705.08741 (2017).

[25] Drew A. Hudson and Christopher D. Manning. 2018. Compositional Attention Networks for Machine Reasoning. arXiv:1803.03067 (2018).

[26] K. D. Humbird, J. L. Peterson, and R. G. McClarren. 2018. Deep Neural Network Initialization With Decision Trees. IEEE Trans Neural Networks and Learning Systems (2018).

[27] Mark Ibrahim, Melissa Louie, Ceena Modarres, and John W. Paisley. 2019. Global Explanations of Neural Networks: Mapping the Landscape of Predictions. arxiv:1902.02384 (2019).

[28] Kaggle. 2019. Historical Data Science Trends on Kaggle. https://www.kaggle. com/shivamb/data-science-trends-on-kaggle. Accessed: 2019-04-20.

[29] Kaggle. 2019. Rossmann Store Sales. https://www.kaggle.com/c/ rossmann-store-sales. Accessed: 2019-11-10.

[30] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, et al. 2017. LightGBM: A Highly Effcient Gradient Boosting Decision Tree. In NIPS.

[31] Guolin Ke, Jia Zhang, Zhenhui Xu, Jiang Bian, and Tie-Yan Liu. 2019. TabNN: A Universal Neural Network Solution for Tabular Data. https://openreview.net/forum?id=r1eJssCqY7

[32] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In ICLR.

[33] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bul. 2015. Deep Neural Decision Forests. In ICCV.

[34] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Classification. In AAAI.

[35] Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2017. Table-to-text Generation by Structure-aware Seq2seq Learning. arXiv:1711.09724 (2017).

[36] Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2018. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv:1802.03888 (2018).

[37] Andre F. T. Martins and Ram ´ on Fern ´ andez Astudillo. 2016. From Softmax ´ to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. arXiv:1602.02068 (2016).

[38] Rory Mitchell, Andrey Adinets, Thejaswi Rao, and Eibe Frank. 2018. XGBoost: Scalable GPU Accelerated Learning. arXiv:1806.11248 (2018).

[39] Decebal Mocanu, Elena Mocanu, Peter Stone, Phuong Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications 9 (12 2018).

[40] Alex Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, and Danilo J. Rezende. 2019. S3TA: A Soft, Spatial, Sequential, Top-Down Attention Model. https://openreview.net/forum?id=B1gJOoRcYQ

[41] Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. 2017. Exploring Sparsity in Recurrent Neural Networks. arXiv:1704.05119 (2017).

[42] Nbviewer. 2019. Notebook on Nbviewer. https://nbviewer.jupyter.org/github/ dipanjanS/data science for all/blob/master/tds model interpretation xai/ Human-interpretableMachineLearning-DS.ipynb#

[43] N. C. Oza. 2005. Online bagging and boosting. In IEEE Trans Conference on Systems, Man and Cybernetics.

[44] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. 2018. Continual Lifelong Learning with Neural Networks: A Review. arXiv:1802.07569 (2018).

[45] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In NIPS.

[46] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 (2015).

[47] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. 2007. Self-Taught Learning: Transfer Learning from Unlabeled Data. In ICML.

[48] Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. fiWhy Should I Trust You?fi: Explaining the Predictions of Any Classifier. In KDD.

[49] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Important Features Through Propagating Activation Differences. arXiv:1704.02685 (2017).

[50] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 (2014).

[51] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2018. AutoInt: Automatic Feature Interaction Learning via SelfAttentive Neural Networks. arxiv:1810.11921 (2018).

[52] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. arXiv:1703.01365 (2017).

[53] Ryutaro Tanno, Kai Arulkumaran, Daniel C. Alexander, Antonio Criminisi, and Aditya V. Nori. 2018. Adaptive Neural Trees. arXiv:1807.06699 (2018).

[54] Tensorflow. 2019. Classifying Higgs boson processes in the HIGGS Data Set. https://github.com/tensorflow/models/tree/master/offcial/boosted trees

[55] Trieu H. Trinh, Minh-Thang Luong, and Quoc V. Le. 2019. Selfie: Self-supervised Pretraining for Image Embedding. arXiv:1906.02940 (2019).

[56] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol ¨ Vinyals, et al. 2016. WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499 (2016).

[57] Sethu Vijayakumar and Stefan Schaal. 2000. Locally Weighted Projection Regression: An O(n) Algorithm for Incremental Real Time Learning in High Dimensional Space. In ICML.

[58] Suhang Wang, Charu Aggarwal, and Huan Liu. 2017. Using a random forest to inspire a neural network and improving on it. In SDM.

[59] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural Networks. arXiv:1608.03665 (2016).

[60] Yongxin Yang, Irene Garcia Morillo, and Timothy M. Hospedales. 2018. Deep Neural Decision Trees. arXiv:1806.06988 (2018).

[61] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2019. INVASE: Instancewise Variable Selection using Neural Networks. In ICLR.

Treinamento em dados tabulares. TABNet. Parte 1

Resumidamente

1. Introdução

2.

3. TABNET

More articles: