It follows that if it were possible to predict, in advance of filing, to which Art Unit a US patent application would likely be assigned, this could go some way towards predicting the prospects of success in examination.
As also mentioned in my earlier article, the USPTO Office of the Chief Economist has published the Patent Claims Research Dataset, comprising six data files containing individually-parsed claims, claim-level statistics, and document-level statistics for all US patents granted between 1976 and 2014 and US patent applications published between 2001 and 2014.
Having this large data set available caused me to wonder: are patent claims a good predictor of the Art Unit to which an application may be assigned? On the one hand, it seems logical that might be the case. After all, Art Unit assignments are based on technology, and it is generally necessary to refer to technical features of the invention in the patent claims. On the other hand, claim language is often broader and more abstract that the specific field of technology to which the claimed invention may be directed, and the allocation to an Art Unit is, in practice, based upon an initial review of the patent specification as a whole, and not just the claims.
There is, however, only one way to find out whether there is a sufficiently strong correlation between claim language and Art Unit to enable prediction, and that is to conduct some experiments using the available data.
My initial results are promising. By employing some relatively straightforward text processing techniques, I have successfully predicted the Technology Centre to which an application is assigned in just over 70% of cases, the correct group of 10 Art Units in over 40% of cases, and the individual Art Unit in around 24% of cases. This is certainly sufficient to encourage me to persevere with some more sophisticated techniques, to see if it is possible to make further improvements.
Source DataThe Patent Claims Research Dataset (PCRD) contains – give or take the occasional processing error – the full text of every claim of 4,954,323 issued US patents. The earliest of these is US patent no. 3,930,271, which was issued on 6 January 1976, and the latest is US patent no. 8,925,110, issued on 30 December 2014. The PAIR (Patent Application Information Retrieval) Bulk Data (PBD) set comprises over 9.4 million records containing bibliographic details of patents and applications in the Public PAIR system. The USPTO claims that PBD is complete back to 1981, with some data dating back as far as 1935. Among the information in the PBD is the USPTO examination Art Unit to which each application was originally assigned.
It is thus possible to match the text of claims in the PCRD to the corresponding Art Units in which they were examined, as recorded in the PBD.
‘Similarity’ AlgorithmTo investigate the hypothesis that claim language correlates with the assigned Art Unit, and thus has predictive capacity, I built a model based on the following algorithm.
- A list of approximately 100,000 patent numbers (the exact number was 110,088) was selected at random from the PCRD. It is convenient to use machine learning nomenclature, and call this the training set, although in the current implementation nothing is really being ‘trained’.
- The complete claim set of each patent in the training set was then retrieved.
- For each set of claims, all numeric characters and punctuation were removed.
- All of the remaining terms were reduced to their ‘stem’ using Martin Porter’s Snowball English stemming algorithm. (‘Stemming’ is the process of converting related terms, such as ‘comprise’, ‘comprises’, and ‘comprising’ to a common base or root form, such as ‘compris’.)
- All terms of three characters or less were eliminated from the reduced word list, along with a set of about 100 extremely common stemmed terms (including such patent claim classics as ‘compris’, ‘includ’, ‘apparatus’, ‘process’ and ‘arrang’) that were identified in a first pass through the claims. Such words appear so commonly (I set a threshold of 12.5% of all claim sets) that they have no capacity to distinguish patents in different fields of technology.
- For each patent in the training set, a list was generated of the remaining terms appearing in its claims, and the corresponding frequency (i.e. the number of times the term appears).
- For every term identified across all ~100,000 patents, a corresponding ‘inverse document frequency’ (IDF) was computed. The basic idea underlying IDF is that a term that is common to only a few documents (i.e. claim sets, in my case) is more likely to be indicative of meaningful similarity between those documents than a word that is common to a larger number of documents. Thus the inverse of a measure of ‘document frequency’ of a term is a useful measure of the relative importance of that term in grouping related documents.
With this data, it is possible to compute a ‘distance’ (specifically the ‘cosine distance’ or ‘cosine similarity’) between any two sets of patent claims. In particular, by selecting at random a patent not in the training set (i.e. patents from what would be known in machine learning parlance as a test set or cross-validation set), it is possible to determine a group of patents within the training set that are the ‘closest’ according to this distance measure.
A k-nearest neighbours (k-NN) algorithm was used (for a range of values of k) in order to predict the Art Unit to which each randomly-selected patent would have been assigned. The k-NN procedure identifies the k ‘closest’ patents in the training set according to the distance measure, each of which has an associated Art Unit. Typically, these Art Unit assignments are not the same, even though the k patents are themselves relatively similar. A ‘voting’ system was therefore applied in order to rank the Art Units assigned to the k closest patents. Not all neighbours received an equal vote – greater weight was given to closer neighbours than to more distant ones.
Three-Level PredictionsThe USPTO’s Art Units are organised in a hierarchy. Each Art Unit (AU) is identified by a four-digit number, of which the first two digits identify a Technology Centre (TC). The AUs within a TC tend to cover broadly related subject matter, although this is not universally true. For example, TC 3600 covers such diverse subject matter as surface transportation (3610-3619), e-commerce (3620-3629) and static structures, supports and furniture (3630-3639). However, AUs within a ‘group of 10’ (i.e. in which the third digit is the same) typically encompass very closely-related technologies.
Accordingly, while the ultimate goal is to correctly predict the precise Art Unit to which a patent application will be assigned, there is considerable value in identifying at least the correct ‘group of 10’ AUs, and some value in identifying the correct TC. The weighted voting system was therefore applied at all three of these levels.
Of course, in this case the Art Unit to which the test-set patents were in fact assigned is known, and it is therefore possible to compare predicted AUs and TCs with actual AUs and TCs to assess the success rate of the algorithm.
Effect of ‘Restructuring’ at the USPTOBefore presenting the results, I just need to mention one final complication. Not surprisingly, given new developments in technology and management demands, the USPTO’s Technology Centres are restructured from time-to-time. I therefore found that the AU assignments of older patents are not a good predictor of the assignments of more recent patents. Thus far, I have been unable to find any information on historical changes that would enable me to map ‘old’ to ‘new’ AU assignments. For the present experiment I have therefore mitigated this issue by restricting my training and test sets to patents with numbers above 6,500,000 (issued on or after 31 December 2002). This reduced the effect of restructuring, but also reduced the size of the training set by around 50%, to 56,564 patents.
Testing and ResultsPredicted TCs, ‘group of 10’ AUs, and specific AUs were determined for a test set of 1000 randomly-selected patents from outside the training set, and compared with the corresponding actual assignments. A range of values of k were trialled, to see whether the number of ‘nearest neighbours’ participating in the voting process has a significant effect on performance. Very similar results were obtained for all values of k in excess of five, with just minor improvements for increasing k, probably reflecting the fact that including more distant neighbours, whose votes receive a lower weighting, has a decreasing effect on the scores assigned to each associated AU, ‘group of 10’ and/or TC.
Overall, the best result was achieved for k = 45, for which the model correctly predicted the specific Art Unit in 23.8% of cases. It predicted an Art Unit within the correct ‘group of 10’ in 40.5% of cases, and the correct Technology Centre in 71.6% of cases.
Conclusion – Next StepsWhile the results so far are not spectacular, they are clearly encouraging. There are a number of reasons to believe that a more sophisticated model could achieve significant improvements. For example:
- while this simple model predicted the correct Art Unit in only 24% of cases, the ‘correct’ result was present among the k nearest neighbours far more frequently than this, suggesting that the initial distance measure may not be optimal;
- the ‘predictive power’ of a term within the current model is determined solely by its IDF weighting, although there is no reason to believe that two terms relating to different fields of technology should have the same significance merely because they occur in the same number of documents in the training set;
- the current model produces specific scores for a set of candidate AUs, but then throws away all of this information other than the identity of the ‘winner’ – there is therefore useful information (e.g. about the confidence of a prediction, and likely alternatives) that is currently discarded; and
- there are many alternative distance measures, and ‘vote’ weightings that could be employed, and it is quite plausible that some of these could produce better results.