Retrieving tables of random sample
Posted: Wed May 22, 2019 3:10 pm
Dear Sirs,
I need to create a random sample of patens from PATSTAT. Browsing in the net, I have found a command form Microsoft SQL: TABLESAMPLE (xx ROWS)
However, it seems it does not work in the editor. Being an unexperienced user of SQL I wonder if I am using it correctly.
My query is quite patchy and a bit long although it seems to be working. I've used it to retrieve patent data from a number of industries.
Here is my query:
SELECT
DISTINCT t1.appln_id , t1.granted, t1.appln_nr, t1.appln_auth, t1.appln_kind, t7.publn_claims,
MIN (t1.appln_filing_date) ApplnFilingDate,
MIN (t1.earliest_filing_date) AS EarliestFilingDate ,
MIN (t1.earliest_publn_date) AS EarliestPublnDate,
MAX (t1.docdb_family_size) AS DocdbFamilySize,
MAX (t1.inpadoc_family_id) AS InpadocFamilyID,
MAX (t1.nb_citing_docdb_fam) AS NbDOCDB,
MAX ( t1.nb_applicants ) AS NbApplicants,
MAX ( t1.nb_inventors) AS NbInventors ,
MAX ( t3.prior_appln_seq_nr) AS PlaceInListPriorities ,
COUNT (DISTINCT t5.psn_id) AS PsnID,
COUNT(CASE WHEN t5.psn_sector='INDIVIDUAL' THEN '1' END) as PsnSectorIND,
COUNT (CASE WHEN t5.psn_sector = 'COMPANY' THEN '1' end) as PsnSectorCOM,
COUNT(CASE WHEN t5.psn_sector = 'UNKNOWN' THEN '1' END) as PsnSectorUNK,
COUNT(CASE WHEN t5.psn_sector = 'GOV' THEN '1' END) as PsnSectorGOV,
COUNT(CASE WHEN t5.psn_sector = 'NON-PROFIT' THEN '1' END) as PsnSectorNON,
COUNT(CASE WHEN t5.psn_sector = 'UNIVERSITY' THEN '1' END) as PsnSectorUNI,
COUNT(CASE WHEN t5.psn_sector = 'HOSPITAL' THEN '1' END) as PsnSectorHOS,
COUNT(CASE WHEN t5.psn_sector = NULL THEN '1' END ) AS PsnSectorNULL,
COUNT (DISTINCT t8.citn_id) AS CitationsInCitingDoc,
COUNT (DISTINCT t8.cited_pat_publn_id) AS CitedPatPublnId,
COUNT (DISTINCT t8.cited_npl_publn_id) AS CitedArticles,
COUNT (DISTINCT t8.cited_appln_id) AS CitedApplnId,
MAX (t8.pat_citn_seq_nr) AS PatCitations,
MAX (t8.npl_citn_seq_nr) AS NonPatCitations,
COUNT (DISTINCT t8.citn_origin) AS CitnOrigin ,
COUNT (t8.citn_origin) AS CitnOrig1,
count (case when t8.citn_origin = 'APP' then '1' end) as CitnAPP,
count(case when t8.citn_origin = 'SEA' then 1 end ) as CitnSEA,
count(case when t8.citn_origin = 'ISR' then 1 end) as CitnISR,
count(case when t8.citn_origin = 'SUP' then 1 end) as CitnSUP,
count(case when t8.citn_origin = 'PRS' then 1 end) as CitnPRS,
count(case when t8.citn_origin = 'EXA' then 1 end) as CitnEXA,
count(case when t8.citn_origin = 'OPP' then 1 end) as CitnOPP,
count(case when t8.citn_origin = 'APL' then 1 end) as CitnAPL,
count(case when t8.citn_origin = 'FOP' then 1 end) as CitnFOP,
count(case when t8.citn_origin = 'TPO' then 1 end) as CitnTPO,
count(case when t8.citn_origin = 'CH2' then 1 end) as CitnCH2,
MAX ( t9.event_seq_nr) AS EventSeqLast,
COUNT (DISTINCT t9.event_seq_nr) AS EventSeqNr,
COUNT (DISTINCT t9.event_type ) AS EventType,
COUNT (DISTINCT t9.event_auth ) AS EventAuthDiversity,
COUNT (DISTINCT t9.event_code ) AS EventCodeDiversity ,
MAX (t9.spc_filing_date ) AS SPCFilingDate,
MAX (t9.spc_patent_expiry_date) AS SPCPatentExpDate,
MAX (t9.spc_extension_date) AS SPCExtensionDate,
COUNT (DISTINCT t9.fee_payment_date ) AS FeePaymentDates,
MAX (t9.fee_payment_date ) AS LastFeePaymentDate,
MIN (t9.fee_payment_date) AS FirstFeePaymentDate,
COUNT (DISTINCT t9.fee_renewal_year ) AS FeeRenewalYr,
MAX (t9.fee_renewal_year ) AS LastFeeRenewalYr
FROM
tls201_appln t1
INNER JOIN tls202_appln_title t2 ON t2.appln_id = t1.appln_id
INNER JOIN tls204_appln_prior t3 ON t3.appln_id = t1.appln_id
INNER JOIN tls207_pers_appln t4 ON t4.appln_id = t1.appln_id
INNER JOIN tls206_person t5 ON t5.person_id = t4.person_id
INNER JOIN tls209_appln_ipc t6 ON t6.appln_id = t1.appln_id
INNER JOIN tls211_pat_publn t7 ON t7.appln_id = t1.appln_id
INNER JOIN tls212_citation t8 ON t8.pat_publn_id = t7.pat_publn_id
INNER JOIN tls231_inpadoc_legal_event t9 ON t9.appln_id = t1.appln_id
WHERE( t6.ipc_class_symbol LIKE 'C22C 3%'
or t6.ipc_class_symbol LIKE 'C22C 33%'
or t6.ipc_class_symbol LIKE 'C22C 35%'
or t6.ipc_class_symbol LIKE 'C22C 47%'
or t6.ipc_class_symbol LIKE 'C22F %'
or t6.ipc_class_symbol LIKE 'C23C 14/56'
or t6.ipc_class_symbol LIKE 'C23C 16/54'
or t6.ipc_class_symbol LIKE 'C25B 9%'
or t6.ipc_class_symbol LIKE 'C25B 15/02'
or t6.ipc_class_symbol LIKE 'C25C %'
or t6.ipc_class_symbol LIKE 'C25D 1%'
or t6.ipc_class_symbol LIKE 'C30B 15/20'
or t6.ipc_class_symbol LIKE 'C30B 35%'
or t6.ipc_class_symbol LIKE 'C40B 60%'
or t6.ipc_class_symbol LIKE 'D01D 10%'
or t6.ipc_class_symbol LIKE 'D01D 11%'
or t6.ipc_class_symbol LIKE 'D01D 13%'
or t6.ipc_class_symbol LIKE 'D01F 9/133'
or t6.ipc_class_symbol LIKE 'D01F 9/32'
or t6.ipc_class_symbol LIKE 'D06B 23/20'
or t6.ipc_class_symbol LIKE 'D21H 23/20'
or t6.ipc_class_symbol LIKE 'D21H 23/70'
) --ADVANCED MANUFACTURING
GROUP BY t1.appln_id , t1.granted, t1.appln_nr, t1.appln_auth, t1.appln_kind, t7.publn_claims
What I need for my study is a control random sample not centered in any sector. I thought it would work turning off the WHERE clause with /* and */, and adding <FROM tls201_appln t1 TABLESAMPLE(1000 ROWS)> just before the first JOIN. But it does not.
What should I do to get a random sample?
Thank you for your attention!
Regards,
Jon Charterina, Ph.D.
I need to create a random sample of patens from PATSTAT. Browsing in the net, I have found a command form Microsoft SQL: TABLESAMPLE (xx ROWS)
However, it seems it does not work in the editor. Being an unexperienced user of SQL I wonder if I am using it correctly.
My query is quite patchy and a bit long although it seems to be working. I've used it to retrieve patent data from a number of industries.
Here is my query:
SELECT
DISTINCT t1.appln_id , t1.granted, t1.appln_nr, t1.appln_auth, t1.appln_kind, t7.publn_claims,
MIN (t1.appln_filing_date) ApplnFilingDate,
MIN (t1.earliest_filing_date) AS EarliestFilingDate ,
MIN (t1.earliest_publn_date) AS EarliestPublnDate,
MAX (t1.docdb_family_size) AS DocdbFamilySize,
MAX (t1.inpadoc_family_id) AS InpadocFamilyID,
MAX (t1.nb_citing_docdb_fam) AS NbDOCDB,
MAX ( t1.nb_applicants ) AS NbApplicants,
MAX ( t1.nb_inventors) AS NbInventors ,
MAX ( t3.prior_appln_seq_nr) AS PlaceInListPriorities ,
COUNT (DISTINCT t5.psn_id) AS PsnID,
COUNT(CASE WHEN t5.psn_sector='INDIVIDUAL' THEN '1' END) as PsnSectorIND,
COUNT (CASE WHEN t5.psn_sector = 'COMPANY' THEN '1' end) as PsnSectorCOM,
COUNT(CASE WHEN t5.psn_sector = 'UNKNOWN' THEN '1' END) as PsnSectorUNK,
COUNT(CASE WHEN t5.psn_sector = 'GOV' THEN '1' END) as PsnSectorGOV,
COUNT(CASE WHEN t5.psn_sector = 'NON-PROFIT' THEN '1' END) as PsnSectorNON,
COUNT(CASE WHEN t5.psn_sector = 'UNIVERSITY' THEN '1' END) as PsnSectorUNI,
COUNT(CASE WHEN t5.psn_sector = 'HOSPITAL' THEN '1' END) as PsnSectorHOS,
COUNT(CASE WHEN t5.psn_sector = NULL THEN '1' END ) AS PsnSectorNULL,
COUNT (DISTINCT t8.citn_id) AS CitationsInCitingDoc,
COUNT (DISTINCT t8.cited_pat_publn_id) AS CitedPatPublnId,
COUNT (DISTINCT t8.cited_npl_publn_id) AS CitedArticles,
COUNT (DISTINCT t8.cited_appln_id) AS CitedApplnId,
MAX (t8.pat_citn_seq_nr) AS PatCitations,
MAX (t8.npl_citn_seq_nr) AS NonPatCitations,
COUNT (DISTINCT t8.citn_origin) AS CitnOrigin ,
COUNT (t8.citn_origin) AS CitnOrig1,
count (case when t8.citn_origin = 'APP' then '1' end) as CitnAPP,
count(case when t8.citn_origin = 'SEA' then 1 end ) as CitnSEA,
count(case when t8.citn_origin = 'ISR' then 1 end) as CitnISR,
count(case when t8.citn_origin = 'SUP' then 1 end) as CitnSUP,
count(case when t8.citn_origin = 'PRS' then 1 end) as CitnPRS,
count(case when t8.citn_origin = 'EXA' then 1 end) as CitnEXA,
count(case when t8.citn_origin = 'OPP' then 1 end) as CitnOPP,
count(case when t8.citn_origin = 'APL' then 1 end) as CitnAPL,
count(case when t8.citn_origin = 'FOP' then 1 end) as CitnFOP,
count(case when t8.citn_origin = 'TPO' then 1 end) as CitnTPO,
count(case when t8.citn_origin = 'CH2' then 1 end) as CitnCH2,
MAX ( t9.event_seq_nr) AS EventSeqLast,
COUNT (DISTINCT t9.event_seq_nr) AS EventSeqNr,
COUNT (DISTINCT t9.event_type ) AS EventType,
COUNT (DISTINCT t9.event_auth ) AS EventAuthDiversity,
COUNT (DISTINCT t9.event_code ) AS EventCodeDiversity ,
MAX (t9.spc_filing_date ) AS SPCFilingDate,
MAX (t9.spc_patent_expiry_date) AS SPCPatentExpDate,
MAX (t9.spc_extension_date) AS SPCExtensionDate,
COUNT (DISTINCT t9.fee_payment_date ) AS FeePaymentDates,
MAX (t9.fee_payment_date ) AS LastFeePaymentDate,
MIN (t9.fee_payment_date) AS FirstFeePaymentDate,
COUNT (DISTINCT t9.fee_renewal_year ) AS FeeRenewalYr,
MAX (t9.fee_renewal_year ) AS LastFeeRenewalYr
FROM
tls201_appln t1
INNER JOIN tls202_appln_title t2 ON t2.appln_id = t1.appln_id
INNER JOIN tls204_appln_prior t3 ON t3.appln_id = t1.appln_id
INNER JOIN tls207_pers_appln t4 ON t4.appln_id = t1.appln_id
INNER JOIN tls206_person t5 ON t5.person_id = t4.person_id
INNER JOIN tls209_appln_ipc t6 ON t6.appln_id = t1.appln_id
INNER JOIN tls211_pat_publn t7 ON t7.appln_id = t1.appln_id
INNER JOIN tls212_citation t8 ON t8.pat_publn_id = t7.pat_publn_id
INNER JOIN tls231_inpadoc_legal_event t9 ON t9.appln_id = t1.appln_id
WHERE( t6.ipc_class_symbol LIKE 'C22C 3%'
or t6.ipc_class_symbol LIKE 'C22C 33%'
or t6.ipc_class_symbol LIKE 'C22C 35%'
or t6.ipc_class_symbol LIKE 'C22C 47%'
or t6.ipc_class_symbol LIKE 'C22F %'
or t6.ipc_class_symbol LIKE 'C23C 14/56'
or t6.ipc_class_symbol LIKE 'C23C 16/54'
or t6.ipc_class_symbol LIKE 'C25B 9%'
or t6.ipc_class_symbol LIKE 'C25B 15/02'
or t6.ipc_class_symbol LIKE 'C25C %'
or t6.ipc_class_symbol LIKE 'C25D 1%'
or t6.ipc_class_symbol LIKE 'C30B 15/20'
or t6.ipc_class_symbol LIKE 'C30B 35%'
or t6.ipc_class_symbol LIKE 'C40B 60%'
or t6.ipc_class_symbol LIKE 'D01D 10%'
or t6.ipc_class_symbol LIKE 'D01D 11%'
or t6.ipc_class_symbol LIKE 'D01D 13%'
or t6.ipc_class_symbol LIKE 'D01F 9/133'
or t6.ipc_class_symbol LIKE 'D01F 9/32'
or t6.ipc_class_symbol LIKE 'D06B 23/20'
or t6.ipc_class_symbol LIKE 'D21H 23/20'
or t6.ipc_class_symbol LIKE 'D21H 23/70'
) --ADVANCED MANUFACTURING
GROUP BY t1.appln_id , t1.granted, t1.appln_nr, t1.appln_auth, t1.appln_kind, t7.publn_claims
What I need for my study is a control random sample not centered in any sector. I thought it would work turning off the WHERE clause with /* and */, and adding <FROM tls201_appln t1 TABLESAMPLE(1000 ROWS)> just before the first JOIN. But it does not.
What should I do to get a random sample?
Thank you for your attention!
Regards,
Jon Charterina, Ph.D.