Citation-analysis issues (family-family citations)

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

DavidGr
Posts: 2
Joined: Thu Apr 14, 2016 1:46 pm

Citation-analysis issues (family-family citations)

Post by DavidGr » Thu Apr 14, 2016 2:28 pm

Hey everyone!

I am a master student currently working on his semester thesis and was interested in using PATSTAT online (version of autumn 2015) to trace back the learning loci in the Lithium-ion battery (LIB) industry. More precisely, I want to perform a forward-citation analysis on a dataset of patents judged relevant:

I am a bit stuck at some point with the use of PATSTAT to finish up the data collection part. Do you think you could help me with that? If this is the case, in order to help you understand my questions, here a summary of my current situation:

I managed without problem to retrieve the patents of interest from the EPO database, but I have some issues performing the citation analysis. More especially, as the on-line version is less potent than the off-line version (especially in terms of intermediate creation of tables), I meet the following issues:
  • (1) I adapted the query from (de Rassenfosse, Dernis, and Boedt 2014) in order to perform a forward citation-analysis on the selected patents (see my query) with a time window of 5 years. My problem now is to further adapt this query to do a family-family citations (as at the moment I am comparing patent applications and expose myself to double counting if I understood well). For this, I had two ideas:
  • (A) Directly extracting the dataset of the relevant patents in terms of LIBs and then using MS access to manually flag the family.
  • (B) Or applying a sub-query directly into the SQL query to restrain the result to intra-family citations.
  • (2) Once I have my final dataset with the relevant patents from the LIBs industry and the counting of forward citations for each of them, I would have like to enhance this database by:
  • •Collecting in a new database all the family_id of the backward AND forward citations of my initial dataset and apply the same key-word search string on them (in order to decrease the number of false negative I have due to the constraint on the technology code). Do you think such a manipulation is possible with PATSTAT online? If yes, is it possible you have some advice concerning such a query?
Thank you for your time.

Cheers,
David

Attached: my query so far:
SELECT
DISTINCT t0.inpadoc_family_id, COUNT(DISTINCT t7.pat_publn_id), t0.appln_id, t6.earliest_date, t8.pat_publn_id, t8.publn_date
FROM
tls201_appln t0
INNER JOIN
(SELECT
t1.appln_id
FROM
tls201_appln t1
INNER JOIN
tls209_appln_ipc t2 ON t1.appln_id = t2.appln_id
INNER JOIN
tls224_appln_cpc t3 ON t2.appln_id = t3.appln_id
INNER JOIN
tls202_appln_title t4 ON t3.appln_id = t4.appln_id
INNER JOIN
tls203_appln_abstr t5 ON t4.appln_id = t5.appln_id
WHERE
YEAR(t1.appln_filing_date) BETWEEN 1990 AND 2010
AND (t1.appln_kind = 'A' OR t1.appln_kind = 'W')

AND (t2.ipc_class_symbol LIKE 'H01M 10/052')
-- other search strings which takes time to be computed
GROUP BY t1.appln_id) ta ON t0.appln_id = ta.appln_id
INNER JOIN
(SELECT
appln_id, MIN(publn_date) AS earliest_date
FROM
tls211_pat_publn
GROUP BY appln_id) t6 ON t0.appln_id = t6.appln_id
INNER JOIN
tls211_pat_publn t6b ON t6b.appln_id = t6.appln_id
INNER JOIN
tls212_citation t7 ON t6b.pat_publn_id = t7.cited_pat_publn_id
INNER JOIN
tls211_pat_publn t8 ON t7.pat_publn_id = t8.pat_publn_id
WHERE
YEAR(t6.earliest_date) != 9999
AND YEAR(t8.publn_date) != 9999
AND DATEDIFF(YEAR, t6.earliest_date, t8.publn_date) <= 5
GROUP BY t0.inpadoc_family_id, t0.appln_id, t6.earliest_date, t8.pat_publn_id, t8.publn_date
ORDER BY COUNT(DISTINCT t7.pat_publn_id) DESC, t0.appln_id ASC;


Geert Boedt
Posts: 176
Joined: Tue Oct 19, 2004 10:36 am
Location: Vienna

Re: Citation-analysis issues (family-family citations)

Post by Geert Boedt » Tue Apr 19, 2016 3:14 pm

Hello David,
you are right about the limitations in PATSTAT Online. The best way for you to proceed with such rather advanced citation analysis is to extract a PATSTAT subset via the download options, and then continue your analysis on your own local computer system.

Specific for your query, here is an example to extract the data (based on your sample):

Code: Select all

Select distinct t1.appln_id from
tls201_appln t1
        LEFT JOIN
tls209_appln_ipc t2 ON t1.appln_id = t2.appln_id
		LEFT JOIN
tls224_appln_cpc t3 ON t2.appln_id = t3.appln_id
WHERE
	YEAR(t1.appln_filing_date) BETWEEN 1990 AND 2010
    AND (t1.appln_kind = 'A' OR t1.appln_kind = 'W')
	AND   (t2.ipc_class_symbol LIKE 'H01M  10/052%'  
        OR t2.ipc_class_symbol LIKE 'H01M   4/13%' 
		OR t3.cpc_class_symbol LIKE 'H01M  10/052%'  
        OR t3.cpc_class_symbol LIKE 'H01M   4/13%'
       	OR t3.cpc_class_symbol LIKE 'Y02E  60/122' 
		OR t3.cpc_class_symbol LIKE 'Y02T  10/7011'
		)
Small observation, I replaced the INNER JOINS by LEFT JOINS, because INNER JOINS will limit the result to applications that have effectively an IPC as well as CPC and title and abstract which I don't think is what you want. Also: the % wild card should be used with LIKE, so I adapted it a bit (but did not check it for correctness with regards to the chosen classifications.)
Use of keywords in title and abstract is mostly only used for very specific searches that can not be done with the classifications; in my opinion this is not need for this technology.
The above query will give you 68.398 distinct applications which is less then the maximum of 100.000 that can be used to create a PATSTAT subset. In order to create a subset, you need to have the appln_id in the SELECT clause. All other attributes are not needed for the download, but it does not harm neither.
Select the TAB download, and then take the option PATSTAT subset. Select the format (CSV or ACCESS) depending on the tools you have available.
download.png
download.png (190.02 KiB) Viewed 4978 times
You can select the tables you need (take all, just to avoid missing anything),and then extend the selection to include the DOCDB family members and the Cited and Citing application. (see screen shot) Your original sample of 68.398 patent patent applications will now be extended by 8.399 family members and 109.436 citations.
You can download the data set as a zipped file and do any further analysis on your own computer system. To count family-family citations, I assume the easiest is to create an extra table that contains the family pairs (based on the 5 year window difference on the publication date), and then simply de-duplicate them (count distinct pairs). This is the same as what is explained in the publication: http://documents.epo.org/projects/babyl ... ies_en.pdf You will observe that the ACCESS extraction has an extra table tls200_appln_origin. This table indicates whether an application is part of the "original" query, or was added through the family & citations extension at download. You will need to use "original or family" as your prime set to count the citations.
DocDB family_id's are available at application level in the tls201_appln table, so you need to join tls201_appln with tls211_pat_publn and then further with the cited publications (alias from tls211) via tls212_citation. You also might want to exclude the self citations, or limit the count the citations given by examiners (citn_orgin = 'SEA')

Code: Select all

SELECT tls201_appln.docdb_family_id, cited_application.docdb_family_id, tls211_pat_publn.publn_date, tls211_pat_publn_1.publn_date, DateDiff("yyyy",tls211_pat_publn_1.publn_date,tls211_pat_publn.publn_date) AS date_dif
FROM ((((tls200_appln_origin INNER JOIN tls201_appln ON tls200_appln_origin.appln_id = tls201_appln.appln_id)
 INNER JOIN tls211_pat_publn ON tls201_appln.appln_id = tls211_pat_publn.appln_id) 
INNER JOIN tls212_citation ON tls211_pat_publn.pat_publn_id = tls212_citation.pat_publn_id) 
INNER JOIN tls211_pat_publn AS tls211_pat_publn_1 ON tls212_citation.cited_pat_publn_id = tls211_pat_publn_1.pat_publn_id)
 INNER JOIN tls201_appln AS cited_application ON tls211_pat_publn_1.appln_id = cited_application.appln_id
WHERE (((tls211_pat_publn_1.publn_date)<>#12/31/9999#) 
AND ((tls200_appln_origin.origin)="original" Or (tls200_appln_origin.origin)="family") 
AND ((tls212_citation.citn_origin)="SEA") 
AND ((DateDiff("yyyy",[tls211_pat_publn_1].[publn_date],[tls211_pat_publn].[publn_date]))<=5));
Best regards,

Geert Boedt
PATSTAT support
Business Use of Patent Information
EPO Vienna


DavidGr
Posts: 2
Joined: Thu Apr 14, 2016 1:46 pm

Re: Citation-analysis issues (family-family citations)

Post by DavidGr » Fri Apr 22, 2016 9:39 am

Hey Geert,

Thank you very much for this detailed answer. It helped me a lot! I would however have three more questions in order to fully understand your answer. (n.b. I used your method for my forward analysis but this time I did not limit myself only to intra-family citations).

(1) Firstly, if I am limiting myself to the ‘SEA’ origin for my citation analysis, am I not decreasing the importance of the American and Asian patents as – at least in the US – the applicants have this “duty of candour” which forces them to cite (sometimes over-cite) during the application phase (and then the number of citations added by the examiners would be decreased?)

(2) Then, is it normal that when performing my forward analysis, when deciding on which level to count my forward citations (appl_id, pat_publn_id or inpadoc_family_id / docdb_family_id) I always have the same count, which means that all forward citations are included as different “inventions” (because count(appl_id) = count(pat_publn_id) = count(inpadoc_family_id) = count(docdb_family_id) ? I mean, I thought that when counting the citing patents that cite my “original” (then my lithium-ion dataset) pool of patents, I expected that some of them would belong to the same family and then the count per inpadoc_family should not be the same than the count per application or publication?

(3) Finally, when using the datediff expression to limit my citation window time to 5 years, I sometimes get negative value. I do understand the mathematical reason behind this (we just required the time difference between the publication date of the cited and citing document to be equal or smaller than 5, which allows for negative values). What I don’t understand is the meaning of such a value. It would indeed mean that the cited document was actually published after the citing document … which makes no sense. Then my question is … what is the trick with the publication dates? I am using the wrong one?

Thank you very much for your time.

Here is joined, if useful for you, my MS query, using the same 'lithium-ion dataset' of original patents downloaded from PATSTAT:

Code: Select all


SELECT cited_application.docdb_family_id, Count(citing_application.docdb_family_id) AS count_docdb
FROM ((((tls200_appln_origin AS citing_origin INNER JOIN tls201_appln AS citing_application ON citing_origin.appln_id = citing_application.appln_id) INNER JOIN tls211_pat_publn AS citing_publn ON citing_application.appln_id = citing_publn.appln_id) INNER JOIN tls212_citation ON citing_publn.pat_publn_id = tls212_citation.pat_publn_id) INNER JOIN (tls211_pat_publn AS cited_publn INNER JOIN tls201_appln AS cited_application ON cited_publn.appln_id = cited_application.appln_id) ON tls212_citation.cited_pat_publn_id = cited_publn.pat_publn_id) INNER JOIN tls200_appln_origin AS cited_origin ON cited_application.appln_id = cited_origin.appln_id
WHERE (((cited_origin.origin)="original") AND ((tls212_citation.citn_origin)="SEA" Or (tls212_citation.citn_origin)="APP") AND ((DateDiff("yyyy",[cited_publn].[publn_date],[citing_publn].[publn_date]))<=5) AND ((cited_publn.publn_date)<>#12/31/9999#))
GROUP BY cited_application.docdb_family_id
ORDER BY Count(citing_application.docdb_family_id) DESC;




Post Reply