Post by Darwin » Mon May 25, 2020 12:15 am

I am trying to measure a country's innovative ability via the number of applications of the country's residence(applicant). But, I am confused about how to identify the first application of each DOCDB_FAMILY_ID.

At the beginning, I want to use the earliest_filing_id of each DOCDB_FAMILY_ID. However, some earliest_filing_id does not exist in the appln_id column of each docdb family group. For example, in DOCDB_FAMILY_ID 9834, the earliest_filing_id is 5668479, but appln_id are 2416995, 14298974,14298975, 36709412. For DOCDB_FAMILY_ID 47996, the earliest_filing_id is 903721475. but the appln_id are 19642068, 21172081.

However, if I use the appln_id which has the earliest appln_filing_date of each docdb_family_id, I may include more than one appln_id for each docdb_family_id. By using DOCDB_FAMILY_ID 9834 as an example, appln_id 2416995, 14298974,14298975, 36709412 are filed at same date 31/10/1991.

Could you please give me some suggestions about how to identify the first application of each DOCDB_FAMILY_ID? and How to restrict my sample? Do I need to exclude the appln_id(of earliest_filing_id) which larger than 900000000?

Thanks in advance.

Re: the first application of each DOCDB_FAMILY_ID

Post by EPO / PATSTAT Support » Mon May 25, 2020 9:12 am

Hello Darwin,
For the majority of the DocDB families, the "earliest_appln_id" will refer to one and the same application for all family members. In the complete PATSTAT database (2020a) there are 185.919 out of 76.458.745 (0.2%) DocDB families that have more then 1 "earliest_appln_id" in the family. The reason for this is the rather complex algorithms used to group patent applications into the same DocDB family, in combination with the fact that US "continuations" have often application dates that are separated by much more then the usual 12 months which are customary for Paris conventions priority rules. The most extreme case is for docdb_family_id = 21978554 having 128 earliest_filing_ids. If those cases play havoc in your analysis, I would simply take the earliest_appn_id with the earliest filing date, and consider that one as the "general earliest" of those exeptional families that have more then one "earliest_appn_id".
For DOCDB_FAMILY_ID 47996, the earliest_filing_id is 903721475
This is the case for PATSTAT 2018b, but in the Patstat2020a version you will find that the earliest_filng_id for DOCDB_FAMILY_ID 47996 is 41409583 (which in itself is even a data error, and I have reported it for correction)
Appln_id (or earliest_filing_id) > 900.000.000 are references to patent filings for which the EPO has not received the data, or common for priority filings that have been withdrawn before publication - and they will never be published. The PATSTAT data catalog provides detailed information on "Application replenishment".
See Documentation: ... tstat.html
PATSTAT Support Team
EPO - Vienna
patstat @

