😭 ⛹🏿 ⚙️ Coleta de dados para treinamento na solução de problemas de PNL 🐨 🎺 🚁

Seleção de uma fonte e ferramentas de implementação

Como fonte de informação, decidi usar o habr.com - um blog coletivo com elementos de um site de notícias (são publicados notícias, artigos analíticos, artigos sobre tecnologia da informação, negócios, Internet, etc.). Neste recurso, todos os materiais são divididos em categorias (hubs), sendo apenas os principais - 416 peças. Cada material pode pertencer a uma ou mais categorias.

() python. – Jupyter notebook Google Colab. :

BeautifulSoup – html / xml;
Requests – http ;
Re – ;
Pandas – .

tqdm ratelim ( ).

, . :

mainUrl = 'https://habr.com/ru/post/'
postCount = 10000

, , , . try… except requests. :

@ratelim.patient(1, 1)
def get_post(postNum):
currPostUrl = mainUrl + str(postNum)
try:
response = requests.get(currPostUrl)
response.raise_for_status()
response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views = executePost(response)
dataList = [postNum, currPostUrl, response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views]
habrParse_df.loc[len(habrParse_df)] = dataList
except requests.exceptions.HTTPError as err:
pass

– . try – , .

executePost - .

def executePost(page):
soup = bs(page.text, 'html.parser')
#   
title = soup.find('meta', property='og:title')
title = str(title).split('="')[1].split('" ')[0]
#   
post = str(soup.find('div', id="post-content-body"))
post = re.sub('\n', ' ', post)
#   
num_comment = soup.find('span', id='comments_count').text
num_comment = int(re.sub('\n', '', num_comment).strip())
#  -     
info_panel = soup.find('ul', attrs={'class' : 'post-stats post-stats_post js-user_'})
#   
try:
rating = int(info_panel.find('span', attrs={'class' : 'voting-wjt__counter js-score'}).text)
except:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_positive js-score'})
if rating:
rating = int(re.sub('/+', '', rating.text))
else:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_negative js-score'}).text
rating = - int(re.sub('–', '', rating))
#         
vote = info_panel.find_all('span')[0].attrs['title']
rating_upVote = int(vote.split(':')[1].split('')[0].strip().split('↑')[1])
rating_downVote = int(vote.split(':')[1].split('')[1].strip().split('↓')[1])
#     
bookmk = int(info_panel.find_all('span')[1].text)
#    
views = info_panel.find_all('span')[3].text
return title, post, num_comment, rating, rating_upVote, rating_downVote, bookmk, views

BeautifulSoup : soup = bs(page.text, ‘html.parser’). find / findall (, html-). , html-, , .

( ), . , 10 . tqdm .

for pc in tqdm(range(postCount)):
postNum = pc + 1
get_post(postNum)

pandas :

Como resultado, recebi um conjunto de dados contendo os textos dos artigos do recurso habr.com , bem como informações adicionais - o título, o link do artigo, o número de comentários, a classificação, o número de marcadores, o número de visualizações .

No futuro, o conjunto de dados resultante pode ser enriquecido com dados adicionais e usado para treinamento na construção de vários modelos de linguagem, classificação de textos, etc.

Coleta de dados para treinamento na solução de problemas de PNL

More articles: