stable person_id vs address replenishments

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

gianluca.tarasconi
Posts: 63
Joined: Mon Nov 09, 2009 8:48 pm
Location: Italy
Contact:

stable person_id vs address replenishments

Post by gianluca.tarasconi » Fri Nov 29, 2013 2:06 pm

dear all

first of all I keep on thanking for having implemented a stable person_id in patstat.
On the other hand I was wondering how will it be when, as promised, you will provide more addresses for application authorities other than EPO: I mean this could mean adding / forking existing person_ids. Could it be an issue for existing stable person_id or you have planned to manage it just by adding the new codes in a way they can be reconciled to the old ones?

I mean if prevuiosly I had
person_id 1 = j. Smith - no address with appln_id 1, 2

then I may have

person_id 1 = j. Smith - no address with appln_id 1
person_id 10000000 = j. Smith - with address with appln_id 2

so it was correct Martins' advice we deal with a list of names (not persons) that may later change...


nico.rasters
Posts: 140
Joined: Wed Jul 08, 2009 5:51 pm
Contact:

Re: stable person_id vs address replenishments

Post by nico.rasters » Fri Dec 06, 2013 9:55 am

Doesn't DOC_STD_NAME_ID solve this?
Using your example, both PERSON_ID 1 and 10000000 would have the same DOC_STD_NAME_ID.
And you're not just dealing with address replenishments. There's also mobility; inventors (or the firms they work for) may relocate.

Btw, if John Smith did not have address information for applications 1 and 2, wouldn't these Smiths be treated as different already?
Source field name:
PERSON_NAME and PERSON_ADDRESS and PERSON_CTRY_CODE in PATSTAT. Allocate a surrogate key PERSON_ID for each combination of these fields. Upper case and lower case are considered equal. E.g. James Bond is considered to be the same person name as JAMES BOND.
Only when the full identification - name and address and country - of a person is known the person is possibly combined. If one of the attributes is missing no combination is done. This can lead to cases where you can clearly guess that 2 entries in the table tls206_person are the same entity, however we are unable to match these accurately by computer.
________________________________________
Nico Doranov
Data Manager

Daigu Academic Services & Data Stewardship
http://www.daigu.nl/


mkracker
Posts: 120
Joined: Wed Sep 04, 2013 6:17 am
Location: Vienna

Re: stable person_id vs address replenishments

Post by mkracker » Tue Dec 10, 2013 10:04 am

The rules for de-duplicating records in table TLS206_PERSON have changed in the Oct 2013 edition.

The only real data which PATSTAT knows about a person are the attributes PERSON_NAME, PERSON_ADDRESS and PERSON_CTRY_CODE. All other attributes are here for technical reasons (PERSON_ID) or are derived from these 3 attributes (DOC_STD_NAME_ID, other harmonized name attributes).

We only regard the 3 attributes PERSON_NAME, PERSON_ADDRESS and PERSON_CTRY_CODE when de-duplicating rows, but the criteria for doing so have changed.

Up to the April 2013 edition rows are de-duplicated if their 3 attributes were “equal”. Upper/lower case did not matter, but empty fields are regarded as different (see Data Catalog v4.50).

Example in April 2013 database: “Siemens” having missing address or country code

person_id,person_ctry_code,doc_std_name_id,person_name,person_address
34614383;"";15501955,SIEMENS,""
34614384,DE,15501955,SIEMENS,""
34614385,"",15501955,Siemens,""
34614389,DE,15501955,Siemens,""
34614390,DE,15501955,SIEMENS,""


From Oct 2013 edition onwards rows are de-duplicated if their 3 attributes are “equal”, but using different criteria. Upper/lower case still does not matter, but 2 empty fields are regarded as identical (see Data Catalog v5.00).

Example in April 2013 database: “Siemens” having missing address or country code

person_id,person_ctry_code,doc_std_name_id,person_name,person_address7404651 ,DE,193,Siemens,""
10955297,"",193,Siemens,""

Reason for the change:
Table TLS206_PERSON and TABLE TLS226_PERSON_ORIG are just tables with strings which represent names. Strictly speaking, one cannot make assumptions whether the names belong to one or more persons / organisations in real life due to incomplete information, misspellings, name / address changes, ... . As a consequence, the underlying implicit assumption in April 2013 or earlier, that
  • * records with data in all 3 fields represent the same person (and therefore these rows are de-duplicated) and
    * records where one or more address fields represent different persons (and therefore these rows are not de-duplicated)
does not seem to be justified.

Since Oct 2013 we are taking a more conservative approach. All rows which are technically speaking equal (assuming upper / lower case is not significant) are just redundant and therefore de-duplicated. No assumptions about real life persons is made in PATSTAT.

To count “different” persons, PATSTAT users can define their own criteria to identify unique persons. They could use names and address information, but also information found in other tables (like their application or the family members of their applications). Or, like Nico correctly proposed, use the harmonized names available, like EPO’s DOCDB standardized names, KU Leuwen EEE-PPAT harmonized names or OECD’s HAN harmonized names (for details see the most current Data Catalog).

Gianluca has given a correct example: In case there are name changes, better address information or just new names / addresses, which result in a new triple of the attributes PERSON_NAME, PERSON_ADDRESS and PERSON_CTRY_CODE, there will simply be a new record in table TLS206_PERSON with a new PERSON_ID.

I hope this helps.

Best regards,
Martin Kracker / EPO
-------------------------------------------
Martin Kracker / EPO


Post Reply