Retrieving tables of random sample

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

jon.charterina@ehu.eus
Posts: 4
Joined: Tue Mar 12, 2019 9:00 am

Retrieving tables of random sample

Post by jon.charterina@ehu.eus » Wed May 22, 2019 3:10 pm

Dear Sirs,
I need to create a random sample of patens from PATSTAT. Browsing in the net, I have found a command form Microsoft SQL: TABLESAMPLE (xx ROWS)
However, it seems it does not work in the editor. Being an unexperienced user of SQL I wonder if I am using it correctly.
My query is quite patchy and a bit long although it seems to be working. I've used it to retrieve patent data from a number of industries.

Here is my query:


SELECT
DISTINCT t1.appln_id , t1.granted, t1.appln_nr, t1.appln_auth, t1.appln_kind, t7.publn_claims,

MIN (t1.appln_filing_date) ApplnFilingDate,
MIN (t1.earliest_filing_date) AS EarliestFilingDate ,
MIN (t1.earliest_publn_date) AS EarliestPublnDate,
MAX (t1.docdb_family_size) AS DocdbFamilySize,
MAX (t1.inpadoc_family_id) AS InpadocFamilyID,
MAX (t1.nb_citing_docdb_fam) AS NbDOCDB,
MAX ( t1.nb_applicants ) AS NbApplicants,
MAX ( t1.nb_inventors) AS NbInventors ,
MAX ( t3.prior_appln_seq_nr) AS PlaceInListPriorities ,
COUNT (DISTINCT t5.psn_id) AS PsnID,

COUNT(CASE WHEN t5.psn_sector='INDIVIDUAL' THEN '1' END) as PsnSectorIND,
COUNT (CASE WHEN t5.psn_sector = 'COMPANY' THEN '1' end) as PsnSectorCOM,
COUNT(CASE WHEN t5.psn_sector = 'UNKNOWN' THEN '1' END) as PsnSectorUNK,
COUNT(CASE WHEN t5.psn_sector = 'GOV' THEN '1' END) as PsnSectorGOV,
COUNT(CASE WHEN t5.psn_sector = 'NON-PROFIT' THEN '1' END) as PsnSectorNON,
COUNT(CASE WHEN t5.psn_sector = 'UNIVERSITY' THEN '1' END) as PsnSectorUNI,
COUNT(CASE WHEN t5.psn_sector = 'HOSPITAL' THEN '1' END) as PsnSectorHOS,
COUNT(CASE WHEN t5.psn_sector = NULL THEN '1' END ) AS PsnSectorNULL,
COUNT (DISTINCT t8.citn_id) AS CitationsInCitingDoc,
COUNT (DISTINCT t8.cited_pat_publn_id) AS CitedPatPublnId,
COUNT (DISTINCT t8.cited_npl_publn_id) AS CitedArticles,
COUNT (DISTINCT t8.cited_appln_id) AS CitedApplnId,
MAX (t8.pat_citn_seq_nr) AS PatCitations,
MAX (t8.npl_citn_seq_nr) AS NonPatCitations,
COUNT (DISTINCT t8.citn_origin) AS CitnOrigin ,
COUNT (t8.citn_origin) AS CitnOrig1,



count (case when t8.citn_origin = 'APP' then '1' end) as CitnAPP,
count(case when t8.citn_origin = 'SEA' then 1 end ) as CitnSEA,
count(case when t8.citn_origin = 'ISR' then 1 end) as CitnISR,
count(case when t8.citn_origin = 'SUP' then 1 end) as CitnSUP,
count(case when t8.citn_origin = 'PRS' then 1 end) as CitnPRS,
count(case when t8.citn_origin = 'EXA' then 1 end) as CitnEXA,
count(case when t8.citn_origin = 'OPP' then 1 end) as CitnOPP,
count(case when t8.citn_origin = 'APL' then 1 end) as CitnAPL,
count(case when t8.citn_origin = 'FOP' then 1 end) as CitnFOP,
count(case when t8.citn_origin = 'TPO' then 1 end) as CitnTPO,
count(case when t8.citn_origin = 'CH2' then 1 end) as CitnCH2,


MAX ( t9.event_seq_nr) AS EventSeqLast,
COUNT (DISTINCT t9.event_seq_nr) AS EventSeqNr,
COUNT (DISTINCT t9.event_type ) AS EventType,
COUNT (DISTINCT t9.event_auth ) AS EventAuthDiversity,
COUNT (DISTINCT t9.event_code ) AS EventCodeDiversity ,

MAX (t9.spc_filing_date ) AS SPCFilingDate,
MAX (t9.spc_patent_expiry_date) AS SPCPatentExpDate,
MAX (t9.spc_extension_date) AS SPCExtensionDate,
COUNT (DISTINCT t9.fee_payment_date ) AS FeePaymentDates,
MAX (t9.fee_payment_date ) AS LastFeePaymentDate,
MIN (t9.fee_payment_date) AS FirstFeePaymentDate,
COUNT (DISTINCT t9.fee_renewal_year ) AS FeeRenewalYr,
MAX (t9.fee_renewal_year ) AS LastFeeRenewalYr

FROM
tls201_appln t1
INNER JOIN tls202_appln_title t2 ON t2.appln_id = t1.appln_id
INNER JOIN tls204_appln_prior t3 ON t3.appln_id = t1.appln_id
INNER JOIN tls207_pers_appln t4 ON t4.appln_id = t1.appln_id
INNER JOIN tls206_person t5 ON t5.person_id = t4.person_id
INNER JOIN tls209_appln_ipc t6 ON t6.appln_id = t1.appln_id
INNER JOIN tls211_pat_publn t7 ON t7.appln_id = t1.appln_id
INNER JOIN tls212_citation t8 ON t8.pat_publn_id = t7.pat_publn_id
INNER JOIN tls231_inpadoc_legal_event t9 ON t9.appln_id = t1.appln_id

WHERE( t6.ipc_class_symbol LIKE 'C22C 3%'
or t6.ipc_class_symbol LIKE 'C22C 33%'
or t6.ipc_class_symbol LIKE 'C22C 35%'
or t6.ipc_class_symbol LIKE 'C22C 47%'
or t6.ipc_class_symbol LIKE 'C22F %'
or t6.ipc_class_symbol LIKE 'C23C 14/56'
or t6.ipc_class_symbol LIKE 'C23C 16/54'
or t6.ipc_class_symbol LIKE 'C25B 9%'
or t6.ipc_class_symbol LIKE 'C25B 15/02'
or t6.ipc_class_symbol LIKE 'C25C %'
or t6.ipc_class_symbol LIKE 'C25D 1%'
or t6.ipc_class_symbol LIKE 'C30B 15/20'
or t6.ipc_class_symbol LIKE 'C30B 35%'
or t6.ipc_class_symbol LIKE 'C40B 60%'
or t6.ipc_class_symbol LIKE 'D01D 10%'
or t6.ipc_class_symbol LIKE 'D01D 11%'
or t6.ipc_class_symbol LIKE 'D01D 13%'
or t6.ipc_class_symbol LIKE 'D01F 9/133'
or t6.ipc_class_symbol LIKE 'D01F 9/32'
or t6.ipc_class_symbol LIKE 'D06B 23/20'
or t6.ipc_class_symbol LIKE 'D21H 23/20'
or t6.ipc_class_symbol LIKE 'D21H 23/70'
) --ADVANCED MANUFACTURING

GROUP BY t1.appln_id , t1.granted, t1.appln_nr, t1.appln_auth, t1.appln_kind, t7.publn_claims


What I need for my study is a control random sample not centered in any sector. I thought it would work turning off the WHERE clause with /* and */, and adding <FROM tls201_appln t1 TABLESAMPLE(1000 ROWS)> just before the first JOIN. But it does not.

What should I do to get a random sample?
Thank you for your attention!
Regards,

Jon Charterina, Ph.D.


EPO / PATSTAT Support
Posts: 229
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Retrieving tables of random sample

Post by EPO / PATSTAT Support » Thu May 23, 2019 9:30 am

Hi Jon,

TABLESAMPLE is not supported by the driver we are using in PATSTAT Online. Nevertheless, there are many other ways to get a random sample.

The technical identifier APPLN_ID is a sequential number assigned to the applications as they are loaded into the database. [[So it can be regarded as random.]] (Update / correction: Although you cannot infer any business information from the APPLN_ID, it generally is the case that lower APPLN_ID values refer to older applications, because these have been loaded at an earlier point in time. Also, because the data is loaded weekly in batches grouped by office (APPLN_AUTH), usually consecutive APPLN_IDs refer to the same office. However, if you do not take a block of consecutive APPLN_IDs but take a random sample over the complete range of APPLN_IDs, you should get a random sample of applications)

To take a sample set of applications you may just use the modulo function of SQL, like;

Code: Select all

SELECT * FROM tls201_appln 
where (appln_id % 10000) = 0  -- or any other number between 0 and 9999
-- Note: corrected on 29.05.2019: TOP clause removed
The advantage is that it will return the same result every time you execute the query.

Other more sophisticated ways are described in this article: https://docs.microsoft.com/en-us/previo ... v=msdn.10)

As a side note:
Due to your (INNER) JOINs you actually retrieve only applications we do have a priority, which delivered legal events, where the title is known, etc . So you do have a bias towards countries and time periods where more data is available by implicitly excluding some older applications or some (smaller) countries or applications without a priority. Depending on your work this might be OK, but you just need to be aware.

I hoped this helped,
Martin
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


jon.charterina@ehu.eus
Posts: 4
Joined: Tue Mar 12, 2019 9:00 am

Re: Retrieving tables of random sample

Post by jon.charterina@ehu.eus » Tue May 28, 2019 6:45 pm

Dear Martin,
Thank you very much for your advice. I had seen the notes you sent me before in the web. I am not an expert as to adapt some of these sentences, in the right way. However, if I am not wrong, I think SELECT TOP * IS not a solution if you want to extract a table that is identical in terms of the fields initially designed in a query such as that of mine. That is, only if you decide to take every and only all the fields from tls201_appln will it work. The problem is that I need summary fields from other tables. Hence all those JOINs among tables.
As for the INNER JOINs, by the way, thank you for your warning message. However, this is the population that I wnat to study specifically.
My problem is that being this the population, I need a contrast table (it could be a sample) made exactly the same, containing elements from any ipc_class_symbol (from tls209_appln_ipc table).

In short, I have decided to:
1st. Extract tables with only the publication year and appln_id everything else equal and putting WHERE tls209_appln_ipc.ipc_class_symbol IS '%' for batches of years or groups of years (making the tables have less than 999.999 rows each). Extract the entire population.
2nd. Create a sample table containing only appln_id numbers using Excel from all the joined tables downloaded in (1)
3rd. Put this subset of codes in the WHERE part of a new search, using WHERE like here:
WHERE (t1.appln_id = '4781' or
t1.appln_id = '4782' or
t1.appln_id = '4783' or
t1.appln_id = '4784' or
t1.appln_id = '4785' or
t1.appln_id = '4786' or
t1.appln_id = '4788'
...

This search would be exactly the same in the SELECT
and FROM parts.
4th. Repeat this as many times as necessary up to complete a sample.

I have the feeling that this is a very clumsy way of working, but some time in the future I hope I will have it done.

Thanks again.

Jon


Post Reply