Application Filing Date/Publication Date Errors

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

DBCerigo
Posts: 4
Joined: Wed Jan 20, 2016 11:52 pm

Application Filing Date/Publication Date Errors

Post by DBCerigo » Mon Feb 01, 2016 12:13 am

Using the 2015 Spring Edition, and SQLite for initial querying then Python (pandas) for analysis.

While carrying out some analysis, I noticed an oddity between an application filing date and publication date - a difference of 100 years. I got to the PDF of the patent through Espacenet and indeed the dates were actually very close in time on the PDF, thus incorrect in PATSTAT. (Examples: apple_id IN (333182097, 41520004))

This got me worried about other possible errors of the same kind, so I investigated further. I first used this query to get the relevant information for all patents in PATSTAT.

Code: Select all

SELECT tls201_appln.appln_id, tls211_pat_publn.pat_publn_id, tls201_appln.appln_filing_year, tls211_pat_publn.publn_date
FROM tls201_appln
INNER JOIN tls211_pat_publn 
ON tls201_appln.appln_id = tls211_pat_publn.appln_id
ORDER BY appln_filing_year, publn_date
Then, parsed the year from each date, and made a new column with the difference between publication year and application year (publn_year - appln_year): call this value `y_diff`. Lastly, ordering the data by y_diff.

Firstly, there were 5481 entries with a negative y_diff, thus implying that the patent was published before it was applied for. I'm assuming that this should never be the case* and thus that all these entries contain an error. I also would have assumed that when the data is being compiled there are automatic sanity checks like this in place?

Secondly, I look at the large positive y_diffs. Taking a very conservative number of at least a 50 year difference between application and publication to point to a possible error, I found 60 such entries. A less conservative approach, looking for differences of 10 years or more gives 1140616 entries.

I'd be very happy to share the resultant lists of applications if it's helpful (though the analysis is replicated easily enough).
I'd also be interested in the result of the same analysis on the latest PATSTAT edition.

Thoughts on the analysis, possible ways to mitigate the problem, anything else relevant, all appreciated.

Daniel Burkhardt Cerigo
Graduate Researcher
Santa Fe Institute

*Though I have learnt that the patent system can have all manor of counter-intuitive complexities to it, so I wouldn't want to state I'm entirely sure.


Geert Boedt
Posts: 176
Joined: Tue Oct 19, 2004 10:36 am
Location: Vienna

Re: Application Filing Date/Publication Date Errors

Post by Geert Boedt » Mon Feb 01, 2016 4:53 pm

Dear Daniel,
your observations are correct, many dates look "impossible" when comparing the publication date to the filing date. And one could assume that the filing date is always before the publication date.

The specific example you gave (Examples: apple_id IN (333182097, 41520004)) are effectively errors, and I have reported them through the data error link : link and they will be corrected.
The reason for this error is probably the quality of the OCR-ed data. Old (and not-so-old written) documents are not keyed-in but OCR-ed; in this case it led to the 100 years + difference. This situation occurs quite often with older documents, and although we can now easily trace them, it is pure manual work to look up the correct date in the PDF (as you did). Another reason for errors is the data quality at source. We receive data from over 100 countries (in many different file formats), checks are in place, but sometimes data formats are changed before we know it. As a result wrong data enters the database (but is mostly corrected afterwards when the error is systematic.)
The EPO does not have the resources to do manual corrections on large batches of records, an therefore we limit corrections of older documents to "important" one's. Recent documents, priority dates that effect families, priority dates for retrieved prior art, reported cases ...
I did a quick analysis with the following SQL: I only took into consideration applications or publications filed after 1970 -what I think is statistically relevant-, then I removed dummy applications through limitations over the kind code and the appln_id. This resulted in about 6000 publications.
I then had a look at the distribution over the various countries/publication kinds. It looks evenly, random distributed and the only one that really jumps out is Norway with 3800 publications . We will have a closer look at that.

SELECT
publn_auth, publn_kind, count(pat_publn_id)
/*
tls201_appln.appln_id, tls201_appln.appln_nr_epodoc,
tls201_appln.appln_filing_date, publn_auth, publn_nr, publn_kind, publn_date,
datediff(day, tls211_pat_publn.publn_date,tls201_appln.appln_filing_date) days_dif
*/
FROM tls201_appln
INNER JOIN tls211_pat_publn
ON tls201_appln.appln_id = tls211_pat_publn.appln_id
where tls211_pat_publn.publn_date < tls201_appln.appln_filing_date
and tls201_appln.appln_id < 900000000 --excluding artificial applications, many having the year 9999
and tls201_appln.appln_filing_year <> 9999
and appln_kind in ('A','U','F','P','W','T')
and (appln_filing_date > '1975-01-01' or publn_date > '1975-01-01')
group by publn_auth, publn_kind
order by publn_auth, publn_kind
Attachments
filing_pub_date.xlsx
(14.76 KiB) Downloaded 211 times
Best regards,

Geert Boedt
PATSTAT support
Business Use of Patent Information
EPO Vienna


Post Reply