Dear all,
I'm trying to match patstat applicants to an external dataset of firms and use a fuzzy matching procedure to match via their names. I've noticed that in addition to name variants (e.g. names ending with Corp., Ltd etc) I also find a lot of misspelled matches in the hrm_l2 variable. For example, fuzzy matching of "SMITHKLINE BEECHAM" gives me "SMITHKLING BEECHAM", "SMITHEKLINE BEECHAM", "SMITHKLINE BECHAM", among many others, all of which also have different hrm_l2_id's. han_name seems to be more harmonized, but has essentially the same problem.
I'm assuming all of these entries refer to the same company and I would like to add them up. What would be the best way to identify all the patent applications by presumably identical, but misspelled applicants? Is there some unique id variable that I'm missing?
Thanks & best regards,
Florian
Misspelled applicants in patstat
-
- Posts: 176
- Joined: Tue Oct 19, 2004 10:36 am
- Location: Vienna
Re: Misspelled applicants in patstat
Hello Florian,
no, you are not missing anything, and there is no golden bullet to solve the problem. The HAN approach differs from the approach developed by the Leuven ECOOM team. Both approaches have their advantages, but none of it is perfect for every situation or use. (The approaches are described in the respective documentation.) The perfect method would probably be that all names are manually checked and matched, but even then it would sometimes not be possible to match names. For statistical purposes: both methods will lead to greatly improved results, but if you want to have a 100% coverage (according to your own criteria), then you will need to manually clean it yourself.
The examples are obviously misspellings, and could probably be grouped, assuming they have the same address and country, but from an "automatic algorithm" approach it would not be possible to do this on the complete person table. If you find larger batches of names that should be linked, kindly provide me with the list of person_id's (which are fixed for all PATSTAT releases), and we will try to implement it in the "manual" matching part of the methodology.
no, you are not missing anything, and there is no golden bullet to solve the problem. The HAN approach differs from the approach developed by the Leuven ECOOM team. Both approaches have their advantages, but none of it is perfect for every situation or use. (The approaches are described in the respective documentation.) The perfect method would probably be that all names are manually checked and matched, but even then it would sometimes not be possible to match names. For statistical purposes: both methods will lead to greatly improved results, but if you want to have a 100% coverage (according to your own criteria), then you will need to manually clean it yourself.
The examples are obviously misspellings, and could probably be grouped, assuming they have the same address and country, but from an "automatic algorithm" approach it would not be possible to do this on the complete person table. If you find larger batches of names that should be linked, kindly provide me with the list of person_id's (which are fixed for all PATSTAT releases), and we will try to implement it in the "manual" matching part of the methodology.
Best regards,
Geert Boedt
PATSTAT support
Business Use of Patent Information
EPO Vienna
Geert Boedt
PATSTAT support
Business Use of Patent Information
EPO Vienna
Re: Misspelled applicants in patstat
Hello Geert,
thanks for your swift and helpful reply. I'll give the approach of exploiting location data in conjunction with names for grouping a try and I'll let you know if I get any useful groupings.
Best,
Florian
thanks for your swift and helpful reply. I'll give the approach of exploiting location data in conjunction with names for grouping a try and I'll let you know if I get any useful groupings.
Best,
Florian