🥟 🍜 🕉️ Clustering e classificação de Big Text Data com Java Machine Learning. Artigo # 2

Olá, Habr! Hoje daremos continuidade ao tópico Clustering e classificação de big text data usando aprendizado de máquina em Java. Este artigo é uma continuação do primeiro artigo .

O artigo conterá a Teoria e a implementação dos algoritmos que usei.

1. Tokenização

Teoria:

‒ . (, ). , , , , , , . . . (), . ‒ ; , . , - . , , . , , .

, . .

, «». (, , , ), , . , , . , .

, PDF-, , , . , . .

. , , , , . , , , , , . , . . . , , , . , .

, , . , , . , . . , , , , , , . , , . , . , .

Iterator<String> finalIterator = new WordIterator(reader);

private final BufferedReader br;
String curLine;
public WordIterator(BufferedReader br) {
        this.br = br;
        curLine = null;
        advance();
    }
    private void advance() {
        try {
            while (true) {
                if (curLine == null || !matcher.find()) {
                    String line = br.readLine();
                    if (line == null) {
                        next = null;
                        br.close();
                        return;
                    }
                    matcher = notWhiteSpace.matcher(line);
                    curLine = line;
                    if (!matcher.find())
                        continue;                    
                }
                next = curLine.substring(matcher.start(), matcher.end());
                break;
            }
        } catch (IOException ioe) {
            throw new IOError(ioe);
        }
    }

2. -

, , «-», «-». , . - . -. - 1958 .. . - ‒ , , . , , , , , , , , , , , , , , , , , , , , , . . , . - , , , . , « », -, “”, “”, “ ”, “”. , «” “”, , , “” „“ . , , , : “”, “ ”, “”, , . , . - , , .

. -, . , » ", «», «», . -, , , , . , . .

- :

- ‒ .
, -, , , , -.
- - . .
, .
, -, , .
- .
- :
: -, -. .
, («—»): - -. (TF-High), , , . . (TF1), (IDF).
(MI): , (, , ), , . , , .

Amostragem aleatória de termos (TBRS): Um método no qual as palavras irrelevantes são detectadas manualmente nos documentos. Este método é usado pela iteração em blocos individuais de dados selecionados aleatoriamente e classificando os recursos em cada bloco com base em seus valores em um formato usando a medida de divergência de Kullback-Leibler, conforme mostrado na seguinte equação:

d_x (t) = Px (t) .log_2⁡ 〖( Px (t)) ⁄ (P (t))〗

onde Px (t) é a frequência normalizada do termo t dentro do peso x

P (t) é a frequência normalizada do termo t em toda a coleção.

A lista de parada final é construída adotando-se os termos menos informativos em todos os documentos, removendo todas as possíveis duplicatas.

O código:

TokenFilter filter = new TokenFilter().loadFromResource("stopwords.txt")
if (!filter.accept(token)) continue;

private Set<String> tokens;
private boolean excludeTokens;
private TokenFilter parent;

public TokenFilter loadFromResource(String fileName) {
		try {
			ClassLoader classLoader = getClass().getClassLoader();
			String str = IOUtils.toString(
					classLoader.getResourceAsStream(fileName),
					Charset.defaultCharset());
			InputStream is = new ByteArrayInputStream(str.getBytes());
			BufferedReader br = new BufferedReader(new InputStreamReader(is));

			Set<String> words = new HashSet<String>();
			for (String line = null; (line = br.readLine()) != null;)
				words.add(line);
			br.close();

			this.tokens = words;
			this.excludeTokens = true;
			this.parent = null;
		} catch (Exception e) {
			throw new IOError(e);
		}
		return this;
	}
public boolean accept(String token) {
		token = token.toLowerCase().replaceAll("[\\. \\d]", "");
		return (parent == null || parent.accept(token))
				&& tokens.contains(token) ^ excludeTokens && token.length() > 2 && token.matches("^[-]+");
	}

Arquivo:

















....

3. Lemmatização

Teoria:

. , . , .

‒ , , . , . , , ( ). , working, works, work work, : work; , . . , computers, computing, computer , : compute, . , , . , - , , , . , , .

Ao longo dos anos, várias ferramentas foram desenvolvidas para fornecer funcionalidade de lematização. Apesar dos diferentes métodos de processamento usados, todos eles usam um léxico de palavras, um conjunto de regras ou uma combinação destes como recursos para análise morfológica. As ferramentas de lematização mais famosas são:

WordNet ‒ WordNet . , , , , . , . WordNet . .
CLEAR ‒ . WordNet , . NLP, , .
GENIA POS , . POS, . : , , . WordNet, , , GENIA PennBioIE. , . , .
TreeTagger POS. , , TreeTagger , . GENIA TreeTagger , POS .
Norm LuiNorm , . , , . UMLS, , , -, . . , . POS .
MorphAdorner – , , , POS . , MorphAdorner , . , .
morpha – . 1400 , , , , . , WordNet, 5 000 6 000 . morpha , .

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String token = documentTokens.next().replaceAll("[^a-zA-Z]", "").toLowerCase();
         Annotation lemmaText = new Annotation(token);
         pipeline.annotate(lemmaText);
         List<CoreLabel> lemmaToken = lemmaText.get(TokensAnnotation.class);
         String word = "";
         for(CoreLabel t:lemmaToken) {
           word = t.get(LemmaAnnotation.class);  //   (  )
         }

4. –

Frequência de termo - Frequência inversa de documento (TF-IDF) é o algoritmo mais amplamente usado para calcular peso de termo (palavra-chave em um documento) em sistemas modernos de recuperação de informações. Esse peso é uma medida estatística usada para avaliar a importância de uma palavra para um documento em uma série de documentos ou em um corpus. O valor aumenta na proporção do número de vezes que a palavra aparece no documento, mas compensa a frequência da palavra no corpus

...

(TF), , , , () . . (), , , . , . TF – . t D:

tf(t,D)=f_(t,D),

f_(t,D) – .

:

«»: tf(t,D) = 1, t D 0 ;

, :

tf(t,D)=f_(t,D)⁄(∑_(t^'∈D)▒f_(t^',D) )

:

log⁡〖(1+f_(t,D))〗

, , , :

tf(t,D)=0.5+0.5*f_(t,D)/(max⁡{f_(t^',D):t'∈D})

IDF, , , , . , , . , , , , :

idf(t,D)=log⁡N/|{d∈D:t∈d}|

TF IDF, TF-IDF, . , , . TF-IDF . TF-IDF : :

tfidf(t,D)=tf(t,D)*idf(t,D)

private final TObjectIntMap<T> counts;
public int count(T obj) {
    int count = counts.get(obj);
    count++;
    counts.put(obj, count);
    sum++;
    return count;
}

public synchronized int addColumn(SparseArray<? extends Number> column) {
     if (column.length() > numRows)
         numRows = column.length();
    
     int[] nonZero = column.getElementIndices();
     nonZeroValues += nonZero.length;
     try {
         matrixDos.writeInt(nonZero.length);
         for (int i : nonZero) {
             matrixDos.writeInt(i); // write the row index
             matrixDos.writeFloat(column.get(i).floatValue());
         }
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
     return ++curCol;
}

public interface SparseArray<T> {
    int cardinality();
    T get(int index);
    int[] getElementIndices();
    int length();
    void set(int index, T obj);
    <E> E[] toArray(E[] array);
}

public File transform(File inputFile, File outFile, GlobalTransform transform) {
     try {
         DataInputStream dis = new DataInputStream(
             new BufferedInputStream(new FileInputStream(inputFile)));
         int rows = dis.readInt();
         int cols = dis.readInt();
         DataOutputStream dos = new DataOutputStream(
             new BufferedOutputStream(new FileOutputStream(outFile)));
         dos.writeInt(rows);
         dos.writeInt(cols);
         for (int row = 0; row < rows; ++row) {
             for (int col = 0; col < cols; ++col) {
                 double val = dis.readFloat();
                 dos.writeFloat((float) transform.transform(row, col, val));
             }
         }
         dos.close();
         return outFile;
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
}

public double transform(int row, int column, double value) {
        double tf = value / docTermCount[column];
        double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
        return tf * idf;
}

public void factorize(MatrixFile mFile, int dimensions) {
        try {
            String formatString = "";
            switch (mFile.getFormat()) {
            case SVDLIBC_DENSE_BINARY:
                formatString = " -r db ";
                break;
            case SVDLIBC_DENSE_TEXT:
                formatString = " -r dt ";
                break;
            case SVDLIBC_SPARSE_BINARY:
                formatString = " -r sb ";
                break;
            case SVDLIBC_SPARSE_TEXT:
                break;
            default:
                throw new UnsupportedOperationException(
                    "Format type is not accepted");
            }

            File outputMatrixFile = File.createTempFile("svdlibc", ".dat");
            outputMatrixFile.deleteOnExit();
            String outputMatrixPrefix = outputMatrixFile.getAbsolutePath();

            LOG.fine("creating SVDLIBC factor matrices at: " + 
                              outputMatrixPrefix);
            String commandLine = "svd -o " + outputMatrixPrefix + formatString +
                " -w dt " + 
                " -d " + dimensions + " " + mFile.getFile().getAbsolutePath();
            LOG.fine(commandLine);
            Process svdlibc = Runtime.getRuntime().exec(commandLine);
            BufferedReader stdout = new BufferedReader(
                new InputStreamReader(svdlibc.getInputStream()));
            BufferedReader stderr = new BufferedReader(
                new InputStreamReader(svdlibc.getErrorStream()));

            StringBuilder output = new StringBuilder("SVDLIBC output:\n");
            for (String line = null; (line = stderr.readLine()) != null; ) {
                output.append(line).append("\n");
            }
            LOG.fine(output.toString());
            
            int exitStatus = svdlibc.waitFor();
            LOG.fine("svdlibc exit status: " + exitStatus);

            if (exitStatus == 0) {
                File Ut = new File(outputMatrixPrefix + "-Ut");
                File S  = new File(outputMatrixPrefix + "-S");
                File Vt = new File(outputMatrixPrefix + "-Vt");
                U = MatrixIO.readMatrix(
                        Ut, Format.SVDLIBC_DENSE_TEXT, 
                        Type.DENSE_IN_MEMORY, true); //  U
                scaledDataClasses = false; 
                
                V = MatrixIO.readMatrix(
                        Vt, Format.SVDLIBC_DENSE_TEXT,
                        Type.DENSE_IN_MEMORY); //  V
                scaledClassFeatures = false;


                singularValues =  readSVDLIBCsingularVector(S, dimensions);
            } else {
                StringBuilder sb = new StringBuilder();
                for (String line = null; (line = stderr.readLine()) != null; )
                    sb.append(line).append("\n");
                // warning or error?
                LOG.warning("svdlibc exited with error status.  " + 
                               "stderr:\n" + sb.toString());
            }
        } catch (IOException ioe) {
            LOG.log(Level.SEVERE, "SVDLIBC", ioe);
        } catch (InterruptedException ie) {
            LOG.log(Level.SEVERE, "SVDLIBC", ie);
        }
    }

    public MatrixBuilder getBuilder() {
        return new SvdlibcSparseBinaryMatrixBuilder();
    }

    private static double[] readSVDLIBCsingularVector(File sigmaMatrixFile,
                                                      int dimensions)
            throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(sigmaMatrixFile));
        double[] m = new double[dimensions];

        int readDimensions = Integer.parseInt(br.readLine());
        if (readDimensions != dimensions)
            throw new RuntimeException(
                    "SVDLIBC generated the incorrect number of " +
                    "dimensions: " + readDimensions + " versus " + dimensions);

        int i = 0;
        for (String line = null; (line = br.readLine()) != null; )
            m[i++] = Double.parseDouble(line);
        return m;
    }

SVD Java ( S-space)

5. Aylien API

Aylien API Text Analysis ‒ API .

Aylien API , , , . ‒ .

, IPTC, -, ‒ IAB-QAG, .

A taxonomia contextual IAB-QAG foi desenvolvida pelo IAB (Interactive Advertising Bureau) em conjunto com especialistas em taxonomia da academia para definir categorias de conteúdo em pelo menos dois níveis diferentes, tornando a classificação de conteúdo muito mais consistente. O primeiro nível é uma categoria de nível amplo e o segundo é uma descrição mais detalhada da estrutura do tipo raiz (Figura 6).

Para usar esta API, você precisa obter a chave e a ID no site oficial. Então, usando esses dados, você pode usar o código Java para chamar os métodos POST e GET.

private static TextAPIClient client = new TextAPIClient(" ", " ")

Você pode então usar a classificação passando os dados a serem classificados.

ClassifyByTaxonomyParams.Builder builder = ClassifyByTaxonomyParams.newBuilder();
URL url = new URL("http://techcrunch.com/2015/07/16/microsoft-will-never-give-up-on-mobile");
builder.setUrl(url);
builder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(builder.build());
for (TaxonomyCategory c: response.getCategories()) {
  System.out.println(c);
}

A resposta do serviço é retornada no formato json:

{
  "categories": [
    {
      "confident": true,
      "id": "IAB19-36",
      "label": "Windows",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19-36",
          "rel": "self"
        },
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "parent"
        }
      ],
      "score": 0.5675236066291172
    },
    {
      "confident": true,
      "id": "IAB19",
      "label": "Technology & Computing",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "self"
        }
      ],
      "score": 0.46704140928338533
    }
  ],
  "language": "en",
  "taxonomy": "iab-qag",
  "text": "When Microsoft announced its wrenching..."
}

Esta API é usada para classificar clusters que serão obtidos usando o método de clustering de aprendizagem não supervisionada.

Posfácio

Ao aplicar os algoritmos descritos acima, existem alternativas e bibliotecas prontas. Você apenas tem que olhar. Se gostou do artigo, ou tem ideias ou dúvidas, por favor deixe seus comentários. A terceira parte será abstrata, discutindo principalmente a arquitetura do sistema. Descrição do algoritmo, o que foi usado e em que ordem.

Adicionalmente, haverá o resultado de cada um após a aplicação de cada algoritmo, bem como o resultado final deste trabalho.

Clustering e classificação de Big Text Data com Java Machine Learning. Artigo # 2 - Algoritmos