Page 1 of 1

How to count distinct patent applications?

Posted: Fri Nov 09, 2012 10:21 pm
by ndbac41
After years of working with the 2009 version of PATSTAT, I recently switched to the latest one. By looking at the documentation of the new version, I feel a bit uncomfortable on how to count distinct patent applications - and I see I'm not the only one having this question. In previous versions, I relied on APPLN_ID to identify distinct patent applications (after filtering out artificial patent applications due to missing priority and citation documents).

The warning at page 52 of the DATA CATALOG worries me ("Warning: Please consider that the application kind code landscape can be at times complicated, eg for German applications ...").

If I understand this correctly, we cannot rely anymore on APPLN_ID to uniquely identify patent applications? (different APPLN_ID’s can point to the same real world patent application).

Technically speaking, a new APPLN_ID is assigned in PATSTAT for every unique combination of patent authority, application number, and application kind. Now (again ignoring artificial patent applications), is it possible that the same real world patent application (and hence the same combination of application authority and number) is present in table APPLN with different application document kinds (and hence different APPLN_ID’s)? The warning about Germany suggests this is possible.

If we check for records in table APPLN with multiple document kinds for the same application authority/number combination, millions of records are found. I presumed these were cases where different types of patents accidentally have the same application number because the respective patent authority uses different numbering schemes/sequences for every type of patent (e.g. JP 2007001482: the same combination of application authority and number is present 3 times in table APPLN with 3 different application kinds: patent, design patent and PCT filing, hence 3 different APPLN_ID’s, makes sense).

If we cannot rely on this – if there are cases where the same authority/number combination with multiple application kinds do refer to the same real world application - how can we identify distinct patent applications?

I'm also confused by the documentation on application document kinds (see again page 52: document kind D K L M N = dummy for de-duplicating). I looked up an example: EP 91904833, 2 different application document kinds (A and D), and hence 2 different APPLN_ID’s. When I look up this patent in Espacenet, it looks like both patents are identical, but with two different patent publication numbers, and the same application number, except for a “D” in the last position of the application number (is this duplication an EPO error?).

Does this mean that every time we find a document kind D, K, L, M and N in table APPLN, we can ignore that APPLN_ID and only count the application with the same authority/number and application document kind different from D, K, L, M and N? (only a limited number of cases are found for EPO and USPTO, but a large number for the German patent office).

Is Germany an exception, and if so, how to count distinct applications for the German patent office (the documentation refers to the forum for a sample query for German applications, but I did not find it).

Tom Magerman

Re: How to count distinct patent applications?

Posted: Mon Mar 18, 2013 5:44 pm
by nico.rasters
I have the October 2011 version so my documentation is older. The changelog for April 2011:

Table TLS201_APPLN: New permanent unique application identifier introduced in APPLN_ID. With the April 2011 edition, the DOCDB "doc-id" unique and stable identifier has been used to populate APPLN_ID instead of creating a PATSTAT-edition-specific surrogate key (but not for the artificial applications in PATSTAT). DOCDB attribute "doc-id" contains a stable and unique identifier that will allow for linking up a number of EPO raw data products through the application in a reliable way. This attribute will remain the same across PATSTAT editions and will always refer to the same combination of application authority, application number and application kind.

It seems to suggest that the artificial applications do not have the same APPLN_ID.

Based on the following query I do find 3,514 results. Note that this is based on PATSTAT October 2011.

It looks like the culprit is kind code D2 (all the doubles have kind code "D2").
From the documentation:
For artificial applications which were created for all artificial publications which were themselves artificially created for those cited publications, where the cited publications are not registered in DOCDB as publications: use the kindcode "D2".

Re: How to count distinct patent applications?

Posted: Mon Mar 18, 2013 6:52 pm
by nico.rasters
I have analyzed the German double patents, and these are my conclusions:
  • The kind code is always D2
  • The filing date is always 9999-31-12
  • The IPR_TYPE is always "PI"
  • APPLN_ID ranges from 907000848 to 908546224; this range seems to be reserved especially for D2 applications.
In the documentation I found mention of special ranges (though the range mentioned here is different):
This paragraph applies strictly only to the October 2011 Edition of the statistical database: these artificial applications derived from priority documents which we cannot 100% match with DOCDB have an APPLN_ID in the range 900,000,001 to 906,561,807 . If you choose to use these surrogate key values in your analyses or programs, please be aware that these APPLN_ID values will always be different in future versions of the database. We use them because it keeps the database smaller, and can help to make the database perform faster.

Re: How to count distinct patent applications?

Posted: Mon Mar 18, 2013 10:35 pm
by ndbac41
Thank you for your efforts to answer my question, but I'm afraid you miss my point. I do understand the issue of artificial applications and hence the fact you might find duplicate records for the same APPLN_ID, hence those 3,514 duplicate results for DE, all having 'D2' as kind code. That can be solved (just filter out aritficial application for the purpose of counting patents).

The problem is that the documentation suggests that the same real world patent can be present in PATSTAT table APPLN multiple times with DIFFERENT APPLN_ID (nothing to do with artificial applications). Because the same application might occur with multiple (hence different) APPLN_KIND, and hence gat a different APPLN_ID (because PATSTAT will issue a new APPLN_ID for every new combination of APPLN_AUTH, APPLN_NR and APPLN_KIND).

To reproduce the issue, just run following query:

from APPLN
where APPLN_ID < 900000000
having count(APPLN_ID) > 1

The basic question is: how can a real world patent with a given combination of APPLN_AUTH and APPLN_NR have multiple records in table APPLN, with DIFFERENT APPLN_ID?

Well, for some patent offices (like JP), it might make sense because different types of patents get their own numbering scheme, so a utility patent and a design patent can have the same application number, although they are different. You can see from the APPLN_KIND that the one is a utility patent, and the other is a design patent. So you still can rely on APPLN_ID to distinguish patent applications.

But for the German patent office, this seems not to be the case. It seems that identical real world patents indeed are present multiple times in table APPLN_ID, because of different APPLN_KIND. But then, how to count distinct applications? We cannot rely on the combination of APPLN_AUTH and APPLN_NR (would e.g. heavily underestimate patent application from the Japanese patent office), but it seems we cannot rely on APPLN_ID (would overestimate the number of patent application in the German patent office).

So again, is it right that German patent applications can be present in table APPLN multiple times with different APPLN_KIND (and hence APPLN_ID), and how can we uniquely identify patent applications?

Re: How to count distinct patent applications?

Posted: Tue Mar 19, 2013 11:50 pm
by nico.rasters
Ah my apologies. I am still a bit confused though. I do not find duplicate records for the same APPLN_ID. That would not be possible, as APPLN_ID is a primary key and therefore unique. What I found were 3514 doubles where the APPLN_NR and APPLN_KIND were the same. Granted, it's a different case then what you were saying as you mentioned "the same APPLN_NR, but different APPLN_KIND". I still have to look into that. But my findings give me the feeling that the German data suffers from entry error. Perhaps they have been appending data instead of updating. While having the same APPLN_NR and a different APPLN_KIND in the database does not make much sense, it makes absolutely no sense to have the same APPLN_NR and the same APPLN_KIND.