While carrying out some analysis, I noticed an oddity between an application filing date and publication date - a difference of 100 years. I got to the PDF of the patent through Espacenet and indeed the dates were actually very close in time on the PDF, thus incorrect in PATSTAT. (Examples: apple_id IN (333182097, 41520004))
This got me worried about other possible errors of the same kind, so I investigated further. I first used this query to get the relevant information for all patents in PATSTAT.
Then, parsed the year from each date, and made a new column with the difference between publication year and application year (publn_year - appln_year): call this value `y_diff`. Lastly, ordering the data by y_diff.
Code: Select all
SELECT tls201_appln.appln_id, tls211_pat_publn.pat_publn_id, tls201_appln.appln_filing_year, tls211_pat_publn.publn_date FROM tls201_appln INNER JOIN tls211_pat_publn ON tls201_appln.appln_id = tls211_pat_publn.appln_id ORDER BY appln_filing_year, publn_date
Firstly, there were 5481 entries with a negative y_diff, thus implying that the patent was published before it was applied for. I'm assuming that this should never be the case* and thus that all these entries contain an error. I also would have assumed that when the data is being compiled there are automatic sanity checks like this in place?
Secondly, I look at the large positive y_diffs. Taking a very conservative number of at least a 50 year difference between application and publication to point to a possible error, I found 60 such entries. A less conservative approach, looking for differences of 10 years or more gives 1140616 entries.
I'd be very happy to share the resultant lists of applications if it's helpful (though the analysis is replicated easily enough).
I'd also be interested in the result of the same analysis on the latest PATSTAT edition.
Thoughts on the analysis, possible ways to mitigate the problem, anything else relevant, all appreciated.
Daniel Burkhardt Cerigo
Santa Fe Institute
*Though I have learnt that the patent system can have all manor of counter-intuitive complexities to it, so I wouldn't want to state I'm entirely sure.