Understanding and dealing with patent families

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

Damian_TR
Posts: 6
Joined: Fri Nov 11, 2022 3:54 pm

Understanding and dealing with patent families

Post by Damian_TR » Mon Dec 19, 2022 2:49 pm

Dear Forum, I am interested in studying Green patents and non-Green patents and counting the citations (excluding self-citation) for firm-year. For that I first used the OECD REGPAT using patent application in Stata (I am not an SQL user). However, I realized that using only EPO patents leaves too many patents outside the study. That said, the problem of using more than one patent authority is the double accounting of the same invention patented (with different appln_ids) in other offices.
So, after reading about it, I see that patent family (protection of the invention, independent of in how many offices are protected) can be used. However, I have several issues I not yet understand.

i. The question would be if these families only capture inventions protected in more than 1 office, or it also includes those inventions protected only in a single office. Imagine a patent is only patented in ES. Does this patent appear in the patent family data?
ii. What if a patent is transferred to another firm? I mean, if firm A patent its invention in ES and then in EPO. Thereafter it transfers the patent to firm B that patent that invention in USPTO. How is this situation captured by patent family data?
iii. Is it supposed that the publication date in the family is the earliest publication date (in this case [see point ii] would it be the one from ES?).
iv. The citations are from family to family: so, are them including the citations in all the offices (in this example: ES, EPO, USPTO) without double accounting? I can imagine that in this example (ii), it may be that USPTO might include a new cited patent not included before. Is this new cited patent included in the family to family citation? Or does it include only the citations coming from the earliest patent document (here ES)?
v. The firm reported is the applicant in the earliest publication (here ES)?
vi. I understand that also the cited patents are family citations, and thus, they give the earliest publication date, and the cited applicant is the one in the earliest publication, right?

Let’s say I need to know: the citing and cited firms’ IDs (HAN_ID), name of the firms, country(ies), publication date (for doing the counting of citations), the citations (I assume that family citations), kind code, technological classifications (IPC and CPC), the publication authority, and if the invention have been transferred to another firm. All this for the family to family dataset coming from any authority (EPO, USPTO, PCT, ES, IT,…).
Is this possible to get? Is it supposed that working with patent family there won’t be double accounting even though taking several patent authorities?
And finally, how can all this be acquired in SQL? Is there an specific family ID that allow me to merge the different tables?
Any help will be much appreciated!


EPO / PATSTAT Support
Posts: 426
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Understanding and dealing with patent families

Post by EPO / PATSTAT Support » Fri Jan 13, 2023 6:08 pm

Hello Damian,

on i) Each patent belongs to exactly 1 DocDB (and INPADOC) patent family, so also the patents that have no family members. A single patent filed in ES, will have no other patent family members.
Via the docdb_family_size attribute, there is an easy and quick way to identify those "single patent inventions". Below is the SQL.

Code: Select all

SELECT *
  FROM tls201_appln
WHERE appln_auth = 'ES' and docdb_family_size = 1 
and ipr_type = 'PI'
and appln_id < 900000000 and appln_filing_year < 9999
order by appln_filing_date desc
ii) transference of a patent has no effect on the family picture or the number of patent family members. It still is a patent that was filed with the intention of protecting a patent in a certain country. Who owns that patent is irrelevant. But it can have an effect when "counting patents per applicant", in case a transfer was done and a later publication with the name of the new owner took place. These cases are rather exceptional. One has to keep in mind that PATSTAT is based on PUBLISHED PATENT DOCUMENTS. A transfer of ownership will normally not be reflected in a publication of a new document.
iii) There is no fixed date linke to a family. Family members can be applied for and published on different dates. But very often (mostly) a family with several applications will have a priority filing. Researches very often use the "earliest priority date" as a proxy for the "family date". This data is pre-aggregated during the PATSTAT production and is stored in the attributes: [earliest_filing_date], [earliest_filing_year] extracte from the application number stored in the [earliest_filing_id].
iv) Citations are at the lowest level stored at publication level. A publication cites another publication. The family/family citations ([tls228_docdb_fam_citn] ) and the attribute [nb_citing_docdb_fam] is a pre-aggregated attribute (tabel) that looks at all the citations at publication level and then de-duplicates and count them at family level. Keep in mind that there are many different kind of citations (see data catalog https://documents.epo.org/projects/baby ... _19_en.pdf). For most researchers, these pre-calculated forward citations are too simple and un-precise (they include also the citations given by the applicant --> bias), and researchers will normally use the [tls212_citation] data to develop more fine-tuned approaches by taking for example only citations that were assigned by examiners during search and examination.
v) No, the applicant name is the name on the published patent document. In most cases, the applicant from the earliest application will also be the applicant for the later applications. But changes can happen. One sometimes sees that large companies will have a local subsidiary recorded as the applicant in a certain country. (tax reasons, costs,....)
vi) No, [tls212_citation] will give the detailed information at publication level, not the earliest publication of the family. Examiners tend to cite patents they can read themselves and will not refer to an earlier application or publication of a family member.

On your last paragraph, Yes, that is possible, but if you also need to have IPC , CPC, transfers, etc... in the same "table" , then it will become complex and rather impossible to extra such data set in 1 single SQL query. It calls for intermediate tables where you then can extract the data you need. (or a data base extraction via PATSTAT Online if your data set is less then 100.000 applications)
Most tables are linked via the application id's. Most data is also application centred. So any analysis you want to do at family level will need input from the researcher on how to deal with data that is stored at application or even publication level (such as citations).
Below is a query that illustrates how to create a "who cites who" table that includes also the citation origins. You can adapt it further to your needs.

Code: Select all

SELECT citing_company.han_name  AS citingcompany_han, citing_app.appln_auth , citing_app.appln_nr, citing_app.docdb_family_id, citing_app.appln_filing_date ,citing_pub.pat_publn_id citing_pub_pat_publn_id
,citing_pub.publn_auth + citing_pub.publn_nr + citing_pub.publn_kind citing_publication, citing_pub.publn_date citing_pub_date
,citn_origin,cited_company.han_name AS citedcompany_han, cited_pub.pat_publn_id, cited_pub.publn_auth + cited_pub.publn_nr cited_publication
,cited_pub.publn_date cited_pub_date, cited_app.docdb_family_id cited_app_docdb_family_id

FROM tls206_person AS citing_company
JOIN tls207_pers_appln AS tls207_pers_appln_1 on citing_company.person_id  = tls207_pers_appln_1.person_id and tls207_pers_appln_1.applt_seq_nr > 0 and tls207_pers_appln_1.invt_seq_nr = 0 and citing_company.han_harmonized =  2
JOIN tls211_pat_publn AS citing_pub ON tls207_pers_appln_1.appln_id = citing_pub.appln_id
JOIN tls212_citation ON citing_pub.pat_publn_id = tls212_citation.pat_publn_id
JOIN tls211_pat_publn AS cited_pub ON tls212_citation.cited_pat_publn_id = cited_pub.pat_publn_id and cited_pub.pat_publn_id <> 0
JOIN tls207_pers_appln tls207_pers_appln_2  ON cited_pub.appln_id = tls207_pers_appln_2.appln_id 
JOIN tls206_person as cited_company ON tls207_pers_appln_2.person_id = cited_company.person_id and tls207_pers_appln_2.applt_seq_nr > 0 and tls207_pers_appln_2.invt_seq_nr = 0 and cited_company.han_harmonized =  2
JOIN tls201_appln citing_app on citing_pub.appln_id = citing_app.appln_id 
JOIN tls201_appln cited_app on cited_pub.appln_id = cited_app.appln_id
WHERE 
citing_company.han_name <> cited_company.han_name --exclude self citations
and citing_app.appln_filing_date = '2000-01-05'
group by citing_company.han_name, citing_app.appln_auth , citing_app.appln_nr, citing_app.docdb_family_id, citing_app.appln_filing_date ,citing_pub.pat_publn_id
,citing_pub.publn_auth + citing_pub.publn_nr + citing_pub.publn_kind, citing_pub.publn_date
,citn_origin,cited_company.han_name, cited_pub.pat_publn_id, cited_pub.publn_auth + cited_pub.publn_nr
,cited_pub.publn_date, cited_app.docdb_family_id,citing_app.appln_id
order by citing_app.appln_filing_date, citing_app.appln_id, citing_pub.pat_publn_id desc, cited_pub.pat_publn_id desc
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Damian_TR
Posts: 6
Joined: Fri Nov 11, 2022 3:54 pm

Re: Understanding and dealing with patent families

Post by Damian_TR » Mon Feb 20, 2023 12:08 pm

Dear PATSTAT, thanks a lot for your detailed explanations. Let me kindly ask one more question if you do not mind.

I have seen that for a given firm (same HAN NAME) its HAN_ID changes, and specially when patents come from different offices (ex. EP, US, WO,...). Does this happen systematically in all PATSTAT? How can I study self citations between families with this problem? Any suggestion?

Also, some firm's name are the same but with puntuation problems, or with differences in capital letters... I thought that these variables (HAN NAME and HAN ID) were harmonized. How does this happen then?

Thanks in advance for your help!


EPO / PATSTAT Support
Posts: 426
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Understanding and dealing with patent families

Post by EPO / PATSTAT Support » Mon Feb 27, 2023 1:47 pm

The HAN names are build on top of a PATSTAT release.
This means that in the PATSTAT data, harmonisation always run a version behind and therefore some names might not be harmonised. (Or some names were not picked-up by the methodology.)
With any harmonisation process or data, you might always expect that some extra cleaning will need to be done.
For example, if you run:

Code: Select all

SELECT  * 
FROM tls206_person
where han_id = 2125445
You will see a list of the all the applicants (persons) that were grouped via the OECD methodology as "NOKIA CORP" in FINLAND. (Some companies file patent in a country under the local name. Nokia Corp having a US address files at USPTO, while the Finish entity files in Finland or at the EPO. So there can indeed be a dependency on the country of filing.)
The HAN_ID is not only linked to the applicant name ! If you run the query below it should be come clear.

Code: Select all

SELECT    han_name, han_id, person_ctry_code,han_harmonized,count(distinct(tls207_pers_appln.appln_id)) applications
FROM tls206_person join tls207_pers_appln on tls206_person.person_id = tls207_pers_appln.person_id
where han_name = 'NOKIA CORP' and han_id < 100000000
group by han_name, han_id,han_harmonized, person_ctry_code
order by applications desc
As a researchers/analyst, you can still decide to group those variations as being one and the same applicants (even through they might be legally different entities.) This would probably make sense when you want to check for self citations.
In the forum you will find some topic on self-citations.
For more information on the OECD methodology, also see:
OECD HAN Database - August 2022.pdf
(272.22 KiB) Downloaded 112 times
See also: patent-publication-research-7663#p20062 (Who is citing who ?)
The WHERE clause AND p_cited.psn_name <> p_citing.psn_name was added to eliminate self citations, but you can adapt it to go specifically looking for the self citations. PSN names were used in this case, but you can change that to HAN. (or better, check it using both harmonisation approaches--> "AND ((p_cited.psn_name = p_citing.psn_name) or (p_cited.han_name = p_citing.han_name))")
You might want to exclude the citations given by the applicant itself to eliminate bias. (citn_origin <> APP)
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Post Reply