30 August 2020

A Recurrent Neural Network for Classifying Patent Application Technology based on Titles

Sorry DaveIn a companion article, I presented the results of using a machine learning model to classify Australian provisional applications into 35 fields of technology based upon nothing but their titles.  In this article, I provide additional technical detail of the model, along with results of its performance in testing and validation.  I also make some observations on the costs of machine learning, in terms of hardware, computation, and energy consumption.  Even for a relatively modest model, these costs may become non-negligible, while recent reports indicate that large-scale state-of-the-art machine learning systems are most likely costing millions of dollars in compute resources and energy to develop.

It is not obvious that a neural network model could be trained to predict the technical field of a patent application given nothing but the title as input.  Human specialists (i.e. patent searchers and examiners) classify applications into very specific technical categories defined by various patent classification systems, such as the International Patent Classification (IPC), or the Cooperative Patent Classification (CPC) which has been jointly developed by the US and European patent offices.  In doing so, the specialists have access to the full patent specification and claims to enable them to determine the subject matter of the invention.

However, while accurate classification at the specificity of systems such as the IPC and CPC based only upon a title would doubtless be impossible – even for a human expert – a less challenging task, such as predicting a field of technology selected from a relatively small number of choices, may be feasible.

Here, I report results of training a neural network model on the task of classifying patent applications according to 35 technical fields grouped into five technology sectors.  The model achieves 67% accuracy, averaged across all technical fields, and nearly 80% accuracy in the best case (‘organic fine chemistry’), if forced to classify each title into a single field of technology.  However, not all misclassifications are necessarily ‘wrong’, given that the subject matter of a single patent application may cross multiple fields of technology.  At the higher level of ‘technology sector’, the model’s accuracy varies between 73% and 91%.  Furthermore, when the model output is used to identify multiple potential fields, the ‘correct’ classification appears in the top four predictions in over 89% of cases.

Overview of Model Design

The particular machine learning model I settled on to classify applications from their titles is based on a type of neural network known as a Gated Recurrent Unit (GRU).  A GRU is itself a type of recurrent neural network (RNN).  The defining characteristics of an RNN are that it contains ‘hidden’ state information (basically a form of ‘memory’), and that its outputs in response to one input are ‘fed-back’ to be combined with each subsequent input.  This means that RNNs are able to process variable-length sequences of inputs, with the output updating as each new input in the sequence is applied.  A GRU differs from more basic forms of RNNs in that it includes elements that enable it to control – and, indeed, to learn – what to ‘remember’ from earlier inputs as it processes a sequence.  GRUs are, in turn, simplified variations on more sophisticated structures known as Long Short-Term Memory (LSTM) networks.

My GRU model implementation receives inputs as a sequence of characters.  In its untrained form, therefore, it knows nothing about such concepts as ‘words’.  Using character-based input has the advantage of limiting the number of different input values that need to be encoded – the letters of the alphabet, the digits, and a few common punctuation symbols are sufficient.  A word-based implementation would require a vocabulary to be built containing all the words needed to train and use the model.  There is no way to add new words once the model has been trained.  The main disadvantage of a character-based implementation is that the model has to learn everything that it needs to know about the structure of individual words, and how they are sequenced into sentences (or, in this case, application titles) from the patterns that it finds in its training data.  As a result, training may be slow, and may require a very large number of examples from which the model can learn.

In developing the model I searched through a number of different designs, and ranges of parameters, in order to find a configuration that was ‘optimum’ within my design constraints.  The final model is bidirectional (i.e. it processes each title both forwards and backwards).  It uses an input ‘embedding layer’ to enable it to learn an encoding of the character set, which I found to be more stable, and produce better results, than a simple manual encoding.  In total, it has 13,710,695 trainable parameters – these are effectively the ‘strengths’ of connections between ‘neurons’ in the network that must be learned through the training process.

Training Data

The dataset used to train the model was drawn from Australian patent applications filed (or entering the national phase, in the case of PCT applications) since 1 January 2002.  The majority of such applications have one or more IPC codes that have been assigned by a patent searcher or examiner in the Australian Patent Office, or in another patent office acting as the International Searching Authority in respect of a PCT application.  I used only the primary IPC code assigned to each application, and mapped it to a corresponding technology field listed in the ‘IPC concordance table’ developed by the World Intellectual Property Organization (WIPO).

The final dataset comprises 499,505 records, each consisting of a title and a technology field number from the following table.

No. Sector Field
1 Electrical engineering Electrical machinery, apparatus, energy
2 Electrical engineering Audio-visual technology
3 Electrical engineering Telecommunications
4 Electrical engineering Digital communication
5 Electrical engineering Basic communication processes
6 Electrical engineering Computer technology
7 Electrical engineering IT methods for management
8 Electrical engineering Semiconductors
9 Instruments Optics
10 Instruments Measurement
11 Instruments Analysis of biological materials
12 Instruments Control
13 Instruments Medical technology
14 Chemistry Organic fine chemistry
15 Chemistry Biotechnology
16 Chemistry Pharmaceuticals
17 Chemistry Macromolecular chemistry, polymers
18 Chemistry Food chemistry
19 Chemistry Basic materials chemistry
20 Chemistry Materials, metallurgy
21 Chemistry Surface technology, coating
22 Chemistry Micro-structural and nano-technology
23 Chemistry Chemical engineering
24 Chemistry Environmental technology
25 Mechanical engineering Handling
26 Mechanical engineering Machine tools
27 Mechanical engineering Engines, pumps, turbines
28 Mechanical engineering Textile and paper machines
29 Mechanical engineering Other special machines
30 Mechanical engineering Thermal processes and apparatus
31 Mechanical engineering Mechanical elements
32 Mechanical engineering Transport
33 Other fields Furniture, games
34 Other fields Other consumer goods
35 Other fields Civil engineering

For training purposes, I randomly shuffled the dataset, and split it into three parts:

  1. a training set (80% of the total), used in the actual training process for each candidate model design/configuration;
  2. a validation set (10% of the total), which is used to check the performance of each trained model on unseen data, to confirm that it has in fact ‘learned’ to generalise, and has not merely ‘memorised’ the training set (a potential problem known as ‘overfitting’); and
  3. a test set (10% of the total), which is not seen during the training and model comparison/optimisation process, and which is used to confirm the performance of the ‘best’ model to ensure that it is not merely the best at predicting technology classification of titles in the the training and validation sets.

Model Optimisation and Performance

To select the final model design, I trained and tested 16 different GRU model configurations, and then interpolated between the four models that performed best on the validation set to determine parameters of a final candidate configuration.  Training and validating this final candidate confirmed that it indeed performed better than any of the original top four on the validation set.  Finally, I tested the final candidate model on the test set (which had not been used up to this stage) and confirmed that it performed equally well (indeed, by chance slightly better) as it had on the validation set.

The chart below is a visual representation, known as a ‘confusion matrix’, of the performance of the final model on the test set of 49,947 title/field pairs.  The vertical axis represents the actual technology field associated with each title (i.e. the ‘true class’), according to the data in the test set, while the horizontal axis represents the technology field predicted by the GRU model, using only the title as input (i.e. the ‘predicted class’).  The ‘number’ (as indicated by the shading level) in each square of the matrix represents the fraction of titles in the corresponding ‘true class’ (i.e. matrix row) that were allocated by the model to the corresponding ‘predicted class’ (i.e. matrix column).  The values across each row thus sum to 1.0.  If the model were ‘perfect’, i.e. if the predicted class always matched the true class, then the only non-zero squares would be the ones on the diagonal, all of which would be equal to 1.0.

Confusion matrix, by field of technology

The clear ‘diagonalisation’ of the confusion matrix indicates that the model is quite successful in correctly classifying titles from the test set.  The model’s strongest performance is in class 14 (‘organic fine chemistry’), in which it achieves an accuracy of 79%.

It should be noted, however, that the confusion matrix is actually a fairly harsh and unforgiving assessment of model performance in this case, since it forces the selection of a single predicted class for each title input (i.e. a ‘hard decision’).  In fact, the model has 35 outputs, each of which provides the model’s estimate of the ‘probability’ that the title is associated with a corresponding one of the available fields of technology.  To generate the confusion matrix, only the largest output value is retained, which fails to account for the reality that titles can be genuinely ambiguous, and also that many applications span multiple fields of technology.  Thus, when the ‘top’ predicted class does not match the true class, it is not necessarily the case that the model is completely wrong – it may have identified another relevant class that does not happen to correspond with the primary IPC code selected by the original human searcher or examiner.  An example I gave in my previous article was the (fictional) title ‘a computer-implemented method for determining oxygen content of a blood sample’.  When this is input to the model, the top four outputs are: analysis of biological materials (48%); computer technology (38%); medical technology (9.5%); and measurement (4.1%).  All of these are quite reasonable classifications given this title.

If the model is indeed ‘cross-classifying’, rather than simply ‘misclassifying’ – at least some of the time – then we might expect to find a relatively high incidence of cross-classification into related fields of technology, i.e. those within the same sector (although, as the example above demonstrates, even cross-sector classifications may be quite valid).  The confusion matrix shown below is the result of aggregating all of the individual technology classifications into their corresponding sectors.  As expected, at this higher level the model is even more successful in making predictions that match the true classifications.  In the ‘chemistry’ sector, in particular, the ‘predicted’ sector matches the ‘true’ sector in over 90% of cases.

Confusion matrix, by sector

Another indication of cross-classification might be that where the maximal output of the model fails to match the ‘true’ class, as assigned by the original human searcher or examiner, this class nonetheless appears among the top few output values.  I tested this on the test set, and found that while the largest output of the model matches the ‘true’ technology field 66.6% of the time, the ‘true’ field is in the top two outputs 80.2% of the time, in the top three 86.0% of the time, and in the top four 89.3% of the time.

When classifying provisional applications using the model, I used all of the outputs to accumulate ‘soft’ decisions across multiple fields of technology, rather than forcing a single ‘hard’ decision (as used for the confusion matrices).  Based on the test results noted above, it is reasonable to assume that under this approach, the overwhelming majority of applications will contribute at least partly to the technology field that would have been selected by a human searcher or examiner, and that the other fields to which each contributes will also commonly be relevant to the subject matter of the application.

Some Notes on the Costs of Machine Learning

A neural network with 13,710,695 trainable parameters that includes a single-layer bidirectional GRU is not at all a large model by current standards.  On the other hand, it is not trivial to train, particularly given the recurrent structure of the GRU, which cycles through a sequence of individual character inputs for each title input, with millions of multiply-add operations being performed in each cycle.  It is not remotely feasible to train this model without using a GPU (graphics processing unit) to perform thousands of computations in parallel.

I have two machines that I used during development of the model.  My laptop is an Alienware m15, which comprises an Intel Core i7-8750H CPU (6 cores, maximum clock speed 4.1GHz), 16GB RAM, and an NVIDIA GeForce GTX 1060 GPU (6GB VRAM, 1280 compute cores), running Windows 10.  This is a mid-range ‘gaming’ laptop that I bought not for gaming, but because its GPU provides a reasonable compute capability for machine learning tasks in a relatively compact and portable package (although the portability is somewhat compromised by the hefty power supply unit that is required to maintain all of that number-crunching grunt at maximum capacity for anything more than an hour or so).  The primary machine used for training and optimisation is a Linux workstation that I had custom-built about three years ago, which comprises an Intel Core i7-7700K CPU (4 cores, maximum clock speed 4.5GHz) and 64GB RAM, to which I have since added an NVIDIA GeForce RTX 2070 GPU (8GB VRAM, 2304 compute cores).

Using the GPUs for computation, the final model takes 115 minutes (i.e. nearly two hours) to train on the workstation, and 312 minutes (i.e. just over five hours) to train on the laptop.  The difference is due to the greater computing power of the workstation GPU – it has 1.8 times as many compute cores, greater memory bandwidth (although this is probably not the bottleneck in this application), and is able to run consistently at its maximum clock rate, whereas the laptop is forced to throttle back the GPU (and CPU) clock to prevent overheating within its less well-ventilated chassis!  Here is an example of the output of the GPU monitoring program nvtop during training of the model on the workstation:

GPU performance monitor display

You can see (in the second line of the display) that the GPU was running at a temperature of 71°C and consuming nearly 230W of power.  Keep in mind that it was doing this for 115 minutes continuously to train the model, and that this was just one of seventeen variants trained during the optimisation process.  This does not include the various smaller trials and experiments using different model designs that I conducted prior to settling on the basic design that I then set about optimising.  Nor does it include test runs and failures during development and debugging of the code.  I am pretty sure that over the past couple of weeks I have consumed more electricity on computing than on lighting and most household appliances!

As I noted above, my GRU classifier is not at all a large model by current machine learning standards.  However, the cost – in terms of time, hardware, energy, and environmental impact – is something that even those of us merely dabbling in the technology need to keep in mind.  At the other end of the scale, estimates of the cost of training OpenAI’s latest ‘GPT-3’ natural language model – which comprises up to 175 billion trainable parameters – have ranged from US$4.6 million to US$12 million!

Going back to my comment about it not being feasible to train my GRU classifier without a GPU, I have not actually attempted to train it on CPU alone – and with very good reason.  I ran some small trials (just one tenth of one percent of the full training process) to estimate how long it would take, and calculated 154 hours (around six and a half days) on the workstation , and 285 hours (nearly 12 days) on the laptop!  (These numbers are interesting: on CPU specifications alone, the i7-8750H CPU in the laptop, with its two additional cores, should be faster than the i7-7700K CPU in the workstation.  But the workstation has faster memory, and the laptop is constrained by power and heat dissipation limits that prevent it from running at its maximum clock rates for extended periods.)

Using GPU computing is vastly more power-efficient (and environmentally-friendly) than trying to run these types of models on CPUs, anyway.  Intel’s specifications for the i7-7700K CPU suggest that it probably dissipates about 100W under full load at maximum clock rate.  Training my GRU model on CPU would thus consume about 35 times as much power as training on the RTX 2070 GPU, in addition to taking about 80 times as long!

Conclusion – Titles not so Obscure or Misleading, After All

Clearly, classifying a patent application into one or more fields of technology based purely upon its title is an inexact science.  Indeed, it is a task that even a technically-qualified human being could not perform with complete accuracy and consistency.  Classification of patent applications is performed by human searchers and examiners with the benefit of access to the full specification and claims.

I therefore find it quite remarkable that a machine learning model can be trained to correctly (compared with human classification) place applications into a single one of 35 technology fields in two-thirds of cases, and into a single broader sector in over three-quarters of cases.  Furthermore, the model is able to cross-classify many applications into multiple relevant fields, and identifies the human classification within its top four selections in around 90% of all cases.

And one final observation: it is often asserted that applicants (or, more commonly, their patent attorneys) devise generic or misleading titles in order intentionally to obscure the subject matter of the patent application.  The fact that a machine learning model can be successfully trained to achieve high accuracy on this classification task demonstrates that this is untrue in the vast majority of cases.

0 comments:

Post a Comment


Copyright © 2014
Creative Commons License
The Patentology Blog by Dr Mark A Summerfield is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Australia License.