Misspelled applicants in patstat

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

Florian33
Posts: 2
Joined: Tue Apr 19, 2016 8:54 am

Misspelled applicants in patstat

Post by Florian33 » Tue Apr 19, 2016 9:06 am

Dear all,

I'm trying to match patstat applicants to an external dataset of firms and use a fuzzy matching procedure to match via their names. I've noticed that in addition to name variants (e.g. names ending with Corp., Ltd etc) I also find a lot of misspelled matches in the hrm_l2 variable. For example, fuzzy matching of "SMITHKLINE BEECHAM" gives me "SMITHKLING BEECHAM", "SMITHEKLINE BEECHAM", "SMITHKLINE BECHAM", among many others, all of which also have different hrm_l2_id's. han_name seems to be more harmonized, but has essentially the same problem.

I'm assuming all of these entries refer to the same company and I would like to add them up. What would be the best way to identify all the patent applications by presumably identical, but misspelled applicants? Is there some unique id variable that I'm missing?

Thanks & best regards,
Florian


Geert Boedt
Posts: 176
Joined: Tue Oct 19, 2004 10:36 am
Location: Vienna

Re: Misspelled applicants in patstat

Post by Geert Boedt » Tue Apr 19, 2016 11:14 am

Hello Florian,
no, you are not missing anything, and there is no golden bullet to solve the problem. The HAN approach differs from the approach developed by the Leuven ECOOM team. Both approaches have their advantages, but none of it is perfect for every situation or use. (The approaches are described in the respective documentation.) The perfect method would probably be that all names are manually checked and matched, but even then it would sometimes not be possible to match names. For statistical purposes: both methods will lead to greatly improved results, but if you want to have a 100% coverage (according to your own criteria), then you will need to manually clean it yourself.
The examples are obviously misspellings, and could probably be grouped, assuming they have the same address and country, but from an "automatic algorithm" approach it would not be possible to do this on the complete person table. If you find larger batches of names that should be linked, kindly provide me with the list of person_id's (which are fixed for all PATSTAT releases), and we will try to implement it in the "manual" matching part of the methodology.
Best regards,

Geert Boedt
PATSTAT support
Business Use of Patent Information
EPO Vienna


Florian33
Posts: 2
Joined: Tue Apr 19, 2016 8:54 am

Re: Misspelled applicants in patstat

Post by Florian33 » Tue Apr 19, 2016 12:42 pm

Hello Geert,

thanks for your swift and helpful reply. I'll give the approach of exploiting location data in conjunction with names for grouping a try and I'll let you know if I get any useful groupings.

Best,
Florian


Post Reply