🌭 🤶🏼 👨🏼‍🔧 Como uma tarefa da coleta de dados clássica, mudou-se para a solução de um problema MNIST simples. Ou como analisei o site do CEC ☑️ 👁‍🗨 🙁

Um dia da semana, à noite, um problema interessante veio do meu chefe. Chega um link com o texto: “Quero tirar tudo daqui, mas tem uma nuance”. Após 2 horas, diga-me o que você pensa sobre como resolver o problema. A hora é 16:00.

Este artigo tratará dessa nuance.

Eu começo o selênio como de costume, e após o primeiro clique no link onde a tabela necessária com os resultados das eleições da República do Tartaristão está localizada, ele falha

Como você entende, a nuance está no fato de que após cada clique no link, aparece um captcha.

Após análise da estrutura do site, constatou-se que o número de links chega a cerca de 30 mil.

Não tive escolha a não ser pesquisar na Internet por maneiras de reconhecer o captcha. Um serviço encontrado

+ Captcha é 100% reconhecido, assim como uma pessoa

- O tempo médio de reconhecimento é de 9 segundos, o que é muito longo, já que temos cerca de 30 mil links diferentes, que precisamos seguir e reconhecer o captcha.

Desisti imediatamente dessa ideia. Depois de várias tentativas de obter o captcha, percebi que ele não muda muito, todos os mesmos números pretos sobre um fundo verde.

E como há muito desejo tocar o "computador de visão" com minhas mãos, decidi que tinha uma grande chance de tentar sozinho o problema MNIST favorito de todos.

Já eram 17 horas e comecei a procurar modelos pré-treinados para reconhecer números. Depois de verificá-los neste captcha, a precisão não me satisfez - bem, é hora de coletar imagens e treinar sua rede neural.

Primeiro, você precisa coletar uma amostra de treinamento.

Abro o driver da web do Chrome e exibo 1000 captchas na minha pasta.

from selenium import webdriver
i = 1000
driver = webdriver.Chrome('/Users/aleksejkudrasov/Downloads/chromedriver')
while i>0:
    driver.get('http://www.vybory.izbirkom.ru/region/izbirkom?action=show&vrn=4274007421995&region=27&prver=0&pronetvd=0')
    time.sleep(0.5)
    with open(str(i)+'.png', 'wb') as file:
        file.write(driver.find_element_by_xpath('//*[@id="captchaImg"]').screenshot_as_png)
    i = i - 1

Como temos apenas duas cores, converti nossos captchas em bw:

from operator import itemgetter, attrgetter
from PIL import Image
import glob
list_img = glob.glob('path/*.png')

for img in list_img:
    im = Image.open(img)
    im = im.convert("P")
    im2 = Image.new("P",im.size,255)

    im = im.convert("P")

    temp = {}
#        
    for x in range(im.size[1]):
        for y in range(im.size[0]):
            pix = im.getpixel((y,x))
            temp[pix] = pix
            if pix != 0: 
                im2.putpixel((y,x),0)

    im2.save(img)

Agora precisamos cortar nossos captchas em números e convertê-los em um único tamanho de 10 * 10.

Primeiro, cortamos o captcha em números e, em seguida, como o captcha é deslocado ao longo do eixo OY, precisamos cortar todos os desnecessários e girar a imagem em 90 °.


def crop(im2):
    inletter = False
    foundletter = False
    start = 0
    end = 0
    count = 0
    letters = []
    name_slise=0
    for y in range(im2.size[0]): 
        for x in range(im2.size[1]): 
            pix = im2.getpixel((y,x))
            if pix != 255:
                inletter = True
#       OX
        if foundletter == False and inletter == True: 
            foundletter = True
            start = y
#       OX 
        if foundletter == True and inletter == False: 
            foundletter = False
            end = y
            letters.append((start,end))

        inletter = False

    for letter in letters:
#   
        im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] )) 
#  90°
        im3 = im3.transpose(Image.ROTATE_90) 

        letters1 = []
#  
        for y in range(im3.size[0]): # slice across
            for x in range(im3.size[1]): # slice down
                pix = im3.getpixel((y,x))
                if pix != 255:
                    inletter = True
            if foundletter == False and inletter == True:
                foundletter = True
                start = y

            if foundletter == True and inletter == False:
                foundletter = False
                end = y
                letters1.append((start,end))

            inletter=False

        for letter in letters1:
#  
            im4 = im3.crop(( letter[0] , 0, letter[1],im3.size[1] )) 
#     
        im4 = im4.transpose(Image.ROTATE_270) 
        resized_img = im4.resize((10, 10), Image.ANTIALIAS)
        resized_img.save(path+name_slise+'.png')
        name_slise+=1

“Já é hora, 18:00, é hora de acabar com esse problema”, pensei, espalhando os números ao longo do caminho em pastas com seus números.

Declaramos um modelo simples que aceita uma matriz expandida de nossa imagem como entrada.

Para fazer isso, crie uma camada de entrada de 100 neurônios, já que o tamanho da imagem é 10 * 10. Como uma camada de saída, existem 10 neurônios, cada um correspondendo a um dígito de 0 a 9.

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Activation, BatchNormalization, AveragePooling2D
from tensorflow.keras.optimizers import SGD, RMSprop, Adam
def mnist_make_model(image_w: int, image_h: int):
    # Neural network model
    model = Sequential()
    model.add(Dense(image_w*image_h, activation='relu', input_shape=(image_h*image_h)))
    model.add(Dense(10, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])
    return model

Dividimos nossos dados em conjuntos de treinamento e teste:


list_folder = ['0','1','2','3','4','5','6','7','8','9']
X_Digit = []
y_digit = []
for folder in list_folder:
    for name in glob.glob('path'+folder+'/*.png'):
        im2 = Image.open(name)
        X_Digit.append(np.array(im2))
        y_digit.append(folder)

Dividimos em treinamento e conjunto de teste:


from sklearn.model_selection import train_test_split

X_Digit = np.array(X_Digit) 
y_digit = np.array(y_digit)
X_train, X_test, y_train, y_test = train_test_split(X_Digit, y_digit, test_size=0.15, random_state=42)
train_data = X_train.reshape(X_train.shape[0], 10*10) #    100
test_data = X_test.reshape(X_test.shape[0], 10*10) #    100
#      10
num_classes = 10
train_labels_cat = keras.utils.to_categorical(y_train, num_classes)
test_labels_cat = keras.utils.to_categorical(y_test, num_classes)

Treinamos o modelo.

Selecione empiricamente os parâmetros do número de épocas e o tamanho do "lote":


model = mnist_make_model(10,10)
model.fit(train_data, train_labels_cat, epochs=20, batch_size=32, verbose=1, validation_data=(test_data, test_labels_cat))

Economizamos os pesos:


model.save_weights("model.h5")

A precisão na 11ª época revelou-se excelente: precisão = 1,0000. Satisfeito, vou para casa descansar às 19:00, amanhã ainda preciso escrever um parser para coletar informações do site do CEC.

Manhã no dia seguinte.

A questão continua pequena, resta percorrer todas as páginas do site do CEC e coletar os dados:

Carregue os pesos do modelo treinado:


model = mnist_make_model(10,10)
model.load_weights('model.h5')

Escrevemos uma função para salvar o captcha:


def get_captcha(driver):
    with open('snt.png', 'wb') as file:
        file.write(driver.find_element_by_xpath('//*[@id="captchaImg"]').screenshot_as_png)
    im2 = Image.open('path/snt.png')
    return im2

Vamos escrever uma função para previsão de captcha:


def crop_predict(im):
    list_cap = []
    im = im.convert("P")
    im2 = Image.new("P",im.size,255)

    im = im.convert("P")

    temp = {}

    for x in range(im.size[1]):
        for y in range(im.size[0]):
            pix = im.getpixel((y,x))
            temp[pix] = pix
            if pix != 0:
                im2.putpixel((y,x),0)
    

    inletter = False
    foundletter=False
    start = 0
    end = 0
    count = 0
    letters = []
    for y in range(im2.size[0]): 
        for x in range(im2.size[1]): 
            pix = im2.getpixel((y,x))
            if pix != 255:
                inletter = True
        if foundletter == False and inletter == True:
            foundletter = True
            start = y

        if foundletter == True and inletter == False:
            foundletter = False
            end = y
            letters.append((start,end))

        inletter=False

    for letter in letters:
        im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] ))
        im3 = im3.transpose(Image.ROTATE_90)

        letters1 = []

        for y in range(im3.size[0]):
            for x in range(im3.size[1]):
                pix = im3.getpixel((y,x))
                if pix != 255:
                    inletter = True
            if foundletter == False and inletter == True:
                foundletter = True
                start = y

            if foundletter == True and inletter == False:
                foundletter = False
                end = y
                letters1.append((start,end))

            inletter=False

        for letter in letters1:
            im4 = im3.crop(( letter[0] , 0, letter[1],im3.size[1] ))
        im4 = im4.transpose(Image.ROTATE_270)
        resized_img = im4.resize((10, 10), Image.ANTIALIAS)
        img_arr = np.array(resized_img)/255
        img_arr = img_arr.reshape((1, 10*10))
        list_cap.append(model.predict_classes([img_arr])[0])
    return ''.join([str(elem) for elem in list_cap])

Adicione uma função que faça download da tabela:


def get_table(driver):
    html = driver.page_source #   
    soup = BeautifulSoup(html, 'html.parser') #  " "
    table_result = [] #       
    tbody = soup.find_all('tbody') #   
    list_tr = tbody[1].find_all('tr') #   
    ful_name = list_tr[0].text #  
    for table in list_tr[3].find_all('table'): #   
        if len(table.find_all('tr'))>5: #  
            for tr in table.find_all('tr'): #   
                snt_tr = []#  
                for td in tr.find_all('td'):
                    snt_tr.append(td.text.strip())#    
                table_result.append(snt_tr)# 
    return (ful_name, pd.DataFrame(table_result, columns = ['index', 'name','count']))

Coletamos todos os links para 13 de setembro:


df_table = []
driver.get('http://www.vybory.izbirkom.ru')
driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[2]/td/center/table/tbody/tr[2]/td/div/table/tbody/tr[3]/td[3]').click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
list_a = soup.find_all('table')[1].find_all('a')
for a in list_a:
    name = a.text
    link = a['href']
    df_table.append([name,link])
df_table = pd.DataFrame(df_table, columns = ['name','link'])

Às 13h, termino de escrever o código com uma passagem por todas as páginas:


result_df = []
for index, line in df_table.iterrows():#     
    driver.get(line['link'])# 
    time.sleep(0.6)
    try:#    
        captcha = crop(get_captcha(driver))
        driver.find_element_by_xpath('//*[@id="captcha"]').send_keys(captcha)
        driver.find_element_by_xpath('//*[@id="send"]').click()
        time.sleep(0.6)
        true_cap(driver)
    except NoSuchElementException:#     
        pass
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    if soup.find('select') is None:#      
        time.sleep(0.6)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')          
        for i in range(len(soup.find_all('tr'))):#    
            if '\n \n' == soup.find_all('tr')[i].text:# ,          
                rez_link = soup.find_all('tr')[i+1].find('a')['href']
        driver.get(rez_link)
        time.sleep(0.6)
        try:
            captcha = crop(get_captcha(driver))
            driver.find_element_by_xpath('//*[@id="captcha"]').send_keys(captcha)
            driver.find_element_by_xpath('//*[@id="send"]').click()
            time.sleep(0.6)
            true_cap(driver)
        except NoSuchElementException:
            pass
        ful_name , table = get_table(driver)# 
        head_name = line['name']
        child_name = ''
        result_df.append([line['name'],line['link'],rez_link,head_name,child_name,ful_name,table])
    else:#   ,   
        options = soup.find('select').find_all('option')
        for option in options:
            if option.text == '---':#     
                continue
            else:
                link = option['value']
                head_name = option.text
                driver.get(link)
                try:
                    time.sleep(0.6)
                    captcha = crop(get_captcha(driver))
                    driver.find_element_by_xpath('//*[@id="captcha"]').send_keys(captcha)
                    driver.find_element_by_xpath('//*[@id="send"]').click()
                    time.sleep(0.6)
                    true_cap(driver)
                except NoSuchElementException:
                    pass
                html2 = driver.page_source
                second_soup = BeautifulSoup(html2, 'html.parser')
                for i in range(len(second_soup.find_all('tr'))):
                    if '\n \n' == second_soup.find_all('tr')[i].text:
                        rez_link = second_soup.find_all('tr')[i+1].find('a')['href']
                driver.get(rez_link)
                try:
                    time.sleep(0.6)
                    captcha = crop(get_captcha(driver))
                    driver.find_element_by_xpath('//*[@id="captcha"]').send_keys(captcha)
                    driver.find_element_by_xpath('//*[@id="send"]').click()
                    time.sleep(0.6)
                    true_cap(driver)
                except NoSuchElementException:
                    pass
                ful_name , table = get_table(driver)
                child_name = ''
                result_df.append([line['name'],line['link'],rez_link,head_name,child_name,ful_name,table])
                if second_soup.find('select') is None:
                    continue
                else:
                    options_2 = second_soup.find('select').find_all('option')
                    for option_2 in options_2:
                        if option_2.text == '---':
                            continue
                        else:
                            link_2 = option_2['value']
                            child_name = option_2.text
                            driver.get(link_2)
                            try:
                                time.sleep(0.6)
                                captcha = crop(get_captcha(driver))
                                driver.find_element_by_xpath('//*[@id="captcha"]').send_keys(captcha)
                                driver.find_element_by_xpath('//*[@id="send"]').click()
                                time.sleep(0.6)
                                true_cap(driver)
                            except NoSuchElementException:
                                pass
                            html3 = driver.page_source
                            thrid_soup = BeautifulSoup(html3, 'html.parser')
                            for i in range(len(thrid_soup.find_all('tr'))):
                                if '\n \n' == thrid_soup.find_all('tr')[i].text:
                                    rez_link = thrid_soup.find_all('tr')[i+1].find('a')['href']
                            driver.get(rez_link)
                            try:
                                time.sleep(0.6)
                                captcha = crop(get_captcha(driver))
                                driver.find_element_by_xpath('//*[@id="captcha"]').send_keys(captcha)
                                driver.find_element_by_xpath('//*[@id="send"]').click()
                                time.sleep(0.6)
                                true_cap(driver)
                            except NoSuchElementException:
                                pass
                            ful_name , table = get_table(driver)
                            result_df.append([line['name'],line['link'],rez_link,head_name,child_name,ful_name,table])

E então vem o tweet que mudou minha vida

Como uma tarefa da coleta de dados clássica, mudou-se para a solução de um problema MNIST simples. Ou como analisei o site do CEC

More articles: