This discussion is archived
1 2 Previous Next 19 Replies Latest reply: Jan 4, 2010 12:36 PM by 807580 RSS

How can I identify a base domain name from a given URL String

807580 Newbie
Currently Being Moderated
I want to compare one URL String to another in order to determine whether or not one domain is local to the other. The problem starts when both domain names are sub-domains.

If I have:

a.example.com & b.example.com

Both are local to example.com, however, I can see no sensible way to programmatically determine this without first compiling a complete list of all domain extensions.

A fool's approach would be to tokenize the Strings with a "." delimiter and compare the second from last tokens, but this won't always work as there are domain extensions like .co.uk.

The only approach I can think of is to compile a full list of domain extensions and scan the Strings for, and remove, the extension. Then compare the right most token delimited by ".".

This would be a bit of a headache, can anyone think of a better way of doing this?

Cheers,

Finbarr
  • 1. Re: How can I identify a base domain name from a given URL String
    JoachimSauer Journeyer
    Currently Being Moderated
    FTAYLOR wrote:
    a.example.com & b.example.com

    Both are local to example.com
    According to which definition?

    That's your problem right there. Whichever spec says that they are "local to example.com" needs to have rules for how to get to that conclusion. Implementing those rules in code will be the easier part.
  • 2. Re: How can I identify a base domain name from a given URL String
    807580 Newbie
    Currently Being Moderated
    Ok, you are looking at forums.sun.com right now. The base domain name is sun.com. You could easily write a regular expression to determine if a domain name is valid or not, and this would require the set of rules you are talking about.

    However, for my purposes, I do not think a rule set will be of any assistance as there are too many variables and no constants. I think I am going to need a full list of domain extensions.
  • 3. Re: How can I identify a base domain name from a given URL String
    JoachimSauer Journeyer
    Currently Being Moderated
    FTAYLOR wrote:
    Ok, you are looking at forums.sun.com right now.
    Right
    The base domain name is sun.com.
    Says who? Why isn't it "com"? You use the term "base domain name" as if it had a well-defined meaning, when in fact it hasn't got one. So you need to either 1.) point us to some definition of the term that you are using or 2.) define the term yourself.
    You could easily write a regular expression to determine if a domain name is valid or not, and this would require the set of rules you are talking about.
    I get the basic idea you are talking about. But that still doesn't define anything that can be put in code.
    However, for my purposes
    What purpose is that? Since you didn't tell us what you want to achieve (or why), we can't tell you how to proceed.
    , I do not think a rule set will be of any assistance as there are too many variables and no constants. I think I am going to need a full list of domain extensions.
    Then get that.

    Edit: this might be helpful. Check the external link on that page.
  • 4. Re: How can I identify a base domain name from a given URL String
    807580 Newbie
    Currently Being Moderated
    Granted I was not very clear in my definition.

    What I meant by a "base domain name" was the Second Level Domain as defined by the Domain Name System, but devoid of any further sub domains. In my earlier example forums.sun.com, com is the Top Level Domain, sun is the Second Level Domain and forums is the Third Level Domain (of N Level Domains). As some Top Level Domains (normally country code Top Level Domains) often define their own Second Level Domain, sometimes it will be necessary for me to extract the Third Level Domain (again, devoid of any further N Level Domains). This task will be nigh on impossible without a list of all TLDs to work out when to use the Third Level Domain instead of the SLD.

    Looks like the Mozilla List is the way to proceed.
  • 5. Re: How can I identify a base domain name from a given URL String
    JoachimSauer Journeyer
    Currently Being Moderated
    FTAYLOR wrote:
    Granted I was not very clear in my definition.

    What I meant by a "base domain name" was the Second Level Domain as defined by the Domain Name System, but devoid of any further sub domains.
    No, that's not what you want, because according to this definition you expect the input "bbc.co.uk" to produce "co.uk".
    In my earlier example forums.sun.com, com is the Top Level Domain, sun is the Second Level Domain and forums is the Third Level Domain (of N Level Domains). As some Top Level Domains (normally country code Top Level Domains) often define their own Second Level Domain, sometimes it will be necessary for me to extract the Third Level Domain (again, devoid of any further N Level Domains).
    There's some nastiness even there: Austria has .at and there are tons of "foo.at" sites. But there's also ".ac.at" for universities and ".gv.at" for government sites. So there you'd need the precise list of ccSLDs to finish your task, not just the information that ".at" has "special" SLDs.
    This task will be nigh on impossible without a list of all TLDs to work out when to use the Third Level Domain instead of the SLD.
    It is impossible without such a list.
  • 6. Re: How can I identify a base domain name from a given URL String
    807580 Newbie
    Currently Being Moderated
    I can't think of a case, off the top of my head, where a two letter final component isn't a country code, or where a three letter final component is.
  • 7. Re: How can I identify a base domain name from a given URL String
    791266 Explorer
    Currently Being Moderated
    malcolmmc wrote:
    I can't think of a case, off the top of my head, where a two letter final component isn't a country code
    eu isn't a country.
  • 8. Re: How can I identify a base domain name from a given URL String
    807580 Newbie
    Currently Being Moderated
    kajbj wrote:
    malcolmmc wrote:
    I can't think of a case, off the top of my head, where a two letter final component isn't a country code
    eu isn't a country.
    It wants to be, though.

    Any more counter-examples to special-case?
  • 9. Re: How can I identify a base domain name from a given URL String
    791266 Explorer
    Currently Being Moderated
    malcolmmc wrote:
    kajbj wrote:
    malcolmmc wrote:
    I can't think of a case, off the top of my head, where a two letter final component isn't a country code
    eu isn't a country.
    It wants to be, though.
    Over my dead body :p
    Any more counter-examples to special-case?
    There seems to be more special cases. Ax (Top domain for Åland which belongs to Finland) isn't a country.

    .. but why was it important to know if a two letter component represents a country or not?
  • 10. Re: How can I identify a base domain name from a given URL String
    807580 Newbie
    Currently Being Moderated
    Scotland is due to apply for the .scot ccTLD and Wales for the .cym ccTLD, so even if there are not currently three letter ccTLDs there soon will be.

    @Joachim please check definition caveats in further sentences from the same paragraph prior to jumping to conclusions, and thanks for your help.
  • 11. Re: How can I identify a base domain name from a given URL String
    791266 Explorer
    Currently Being Moderated
    FTAYLOR wrote:
    Scotland is due to apply for the .scot ccTLD and Wales for the .cym ccTLD, so even if there are not currently three letter ccTLDs there soon will be.

    @Joachim please check definition caveats in further sentences from the same paragraph prior to jumping to conclusions, and thanks for your help.
    Note that ax that I mentioned earlier can become a hube problem to you. ".ax" used to be "aland.fi", so e.g. peace.aland.fi is the same as peace.ax
  • 12. Re: How can I identify a base domain name from a given URL String
    807580 Newbie
    Currently Being Moderated
    I am not going to bother compiling a full list. The whole point of identifying whether or not two domain names are equal to one another is only used as a very small part of a much larger machine learning process.

    A-O

    ac com.ac edu.ac gov.ac net.ac mil.ac org.ac ad nom.ad ae net.ae gov.ae org.ae mil.ae sch.ae ac.ae pro.ae name.ae aero af gov.af edu.af net.af com.af ag com.ag org.ag net.ag co.ag nom.ag ai off.ai com.ai net.ai org.ai gov.al edu.al org.al com.al net.al uniti.al tirana.al soros.al upt.al inima.al am an com.an net.an org.an edu.an ao co.ao ed.ao gv.ao it.ao og.ao pb.ao ar com.ar gov.ar int.ar mil.ar net.ar org.ar arpa e164.arpa in-addr.arpa iris.arpa ip6.arpa uri.arpa urn.arpa as asia at at gv.at ac.at co.at or.at priv.at au asn.au com.au net.au id.au org.au csiro.au oz.au info.au conf.au act.au nsw.au nt.au qld.au sa.au tas.au vic.au wa.au gov.au edu.au aw com.aw ax az com.az net.az int.az gov.az biz.az org.az edu.az mil.az pp.az name.az info.az ba bb com.bb edu.bb gov.bb net.bb org.bb bd com.bd edu.bd net.bd gov.bd org.bd mil.bd be ac.be to.be com.be co.be xa.be ap.be bf gov.bf bg bh bi biz bj bm com.bm edu.bm org.bm gov.bm net.bm bn com.bn edu.bn org.bn net.bn bo com.bo org.bo net.bo gov.bo gob.bo edu.bo tv.bo mil.bo int.bo br agr.br am.br art.br edu.br com.br coop.br esp.br far.br fm.br g12.br gov.br imb.br ind.br inf.br mil.br net.br org.br psi.br rec.br srv.br tmp.br tur.br tv.br etc.br adm.br adv.br arq.br ato.br bio.br bmd.br cim.br cng.br cnt.br ecn.br eng.br eti.br fnd.br fot.br fst.br ggf.br jor.br lel.br mat.br med.br mus.br not.br ntr.br odo.br ppg.br pro.br psc.br qsl.br slg.br trd.br vet.br zlg.br dpn.br nom.br bs com.bs net.bs org.bs bt com.bt edu.bt gov.bt net.bt org.bt bw co.bw org.bw by gov.by mil.by bz ca ab.ca bc.ca mb.ca nb.ca nf.ca nl.ca ns.ca nt.ca nu.ca on.ca pe.ca qc.ca sk.ca yk.ca cat cc co.cc cd com.cd net.cd org.cd cf cg ch com.ch net.ch org.ch gov.ch ci ck co.ck cl cm cn ac.cn com.cn edu.cn gov.cn net.cn org.cn ah.cn bj.cn cq.cn fj.cn gd.cn gs.cn gz.cn gx.cn ha.cn hb.cn he.cn hi.cn hl.cn hn.cn jl.cn js.cn jx.cn ln.cn nm.cn nx.cn qh.cn sc.cn sd.cn sh.cn sn.cn sx.cn tj.cn xj.cn xz.cn yn.cn zj.cn co com.co edu.co org.co gov.co mil.co net.co nom.co com coop cr ac.cr co.cr ed.cr fi.cr go.cr or.cr sa.cr cu com.cu edu.cu org.cu net.cu gov.cu inf.cu cv cx gov.cx com.cy biz.cy info.cy ltd.cy pro.cy net.cy org.cy name.cy tm.cy ac.cy ekloges.cy press.cy parliament.cy cz de dj dk dm com.dm net.dm org.dm edu.dm gov.dm do edu.do gov.do gob.do com.do org.do sld.do web.do net.do mil.do art.do dz com.dz org.dz net.dz gov.dz edu.dz asso.dz pol.dz art.dz ec com.ec info.ec net.ec fin.ec med.ec pro.ec org.ec edu.ec gov.ec mil.ec edu ee com.ee org.ee fie.ee pri.ee eg eun.eg edu.eg sci.eg gov.eg com.eg org.eg net.eg mil.eg er es com.es nom.es org.es gob.es edu.es et com.et gov.et org.et edu.et net.et biz.et name.et info.et eu fi aland.fi fj biz.fj com.fj info.fj name.fj net.fj org.fj pro.fj ac.fj gov.fj mil.fj school.fj fk co.fk org.fk gov.fk ac.fk nom.fk net.fk fm fo fr tm.fr asso.fr nom.fr prd.fr presse.fr com.fr gouv.fr ga gd ge com.ge edu.ge gov.ge org.ge mil.ge net.ge pvt.ge gf gg co.gg net.gg org.gg gh com.gh edu.gh gov.gh org.gh mil.gh gi com.gi ltd.gi gov.gi mod.gi edu.gi org.gi gl gm gn com.gn ac.gn gov.gn org.gn net.gn gov gp com.gp net.gp edu.gp asso.gp or org.gp gq gr com.gr edu.gr net.gr org.gr gov.gr gs gt gu gw gy hk com.hk edu.hk gov.hk idv.hk net.hk org.hk hm hn com.hn edu.hn org.hn net.hn mil.hn gob.hn hr iz.hr from.hr name.hr com.hr ht com.ht net.ht firm.ht shop.ht info.ht pro.ht adult.ht org.ht art.ht pol.ht rel.ht asso.ht perso.ht coop.ht med.ht edu.ht gouv.ht hu co.hu info.hu org.hu priv.hu sport.hu tm.hu 2000.hu agrar.hu bolt.hu casino.hu city.hu erotica.hu erotika.hu film.hu forum.hu games.hu hotel.hu ingatlan.hu jogasz.hu konyvelo.hu lakas.hu media.hu news.hu reklam.hu sex.hu shop.hu suli.hu szex.hu tozsde.hu utazas.hu video.hu id ac.id co.id or.id go.id ie gov.ie il ac.il co.il org.il net.il k12.il gov.il muni.il idf.il im co.im ltd.co.im plc.co.im net.im gov.im org.im nic.im ac.im in co.in firm.in net.in org.in gen.in ind.in nic.in ac.in edu.in res.in gov.in mil.in info int io iq ir ac.ir co.ir gov.ir net.ir org.ir sch.ir is it gov.it je co.je net.je org.je jm edu.jm gov.jm com.jm net.jm org.jm jo com.jo org.jo net.jo edu.jo gov.jo mil.jo jobs jp ac.jp ad.jp co.jp ed.jp go.jp gr.jp lg.jp ne.jp or.jp hokkaido.jp aomori.jp iwate.jp miyagi.jp akita.jp yamagata.jp fukushima.jp ibaraki.jp tochigi.jp gunma.jp saitama.jp chiba.jp tokyo.jp kanagawa.jp niigata.jp toyama.jp ishikawa.jp fukui.jp yamanashi.jp nagano.jp gifu.jp shizuoka.jp aichi.jp mie.jp shiga.jp kyoto.jp osaka.jp hyogo.jp nara.jp wakayama.jp tottori.jp shimane.jp okayama.jp hiroshima.jp yamaguchi.jp tokushima.jp kagawa.jp ehime.jp kochi.jp fukuoka.jp saga.jp nagasaki.jp kumamoto.jp oita.jp miyazaki.jp kagoshima.jp okinawa.jp sapporo.jp sendai.jp yokohama.jp kawasaki.jp nagoya.jp kobe.jp kitakyushu.jp ke kg kh per.kh com.kh edu.kh gov.kh mil.kh net.kh org.kh ki km kn kr co.kr or.kr kw com.kw edu.kw gov.kw net.kw org.kw mil.kw ky edu.ky gov.ky com.ky org.ky net.ky kz org.kz edu.kz net.kz gov.kz mil.kz com.kz la lb net.lb org.lb gov.lb edu.lb com.lb lc com.lc org.lc edu.lc gov.lc li com.li net.li org.li gov.li lk gov.lk sch.lk net.lk int.lk com.lk org.lk edu.lk ngo.lk soc.lk web.lk ltd.lk assn.lk grp.lk hotel.lk lr com.lr edu.lr gov.lr org.lr net.lr ls org.ls co.ls lt gov.lt mil.lt lu gov.lu mil.lu org.lu net.lu lv com.lv edu.lv gov.lv org.lv mil.lv id.lv net.lv asn.lv conf.lv ly com.ly net.ly gov.ly plc.ly edu.ly sch.ly med.ly org.ly id.ly ma co.ma net.ma gov.ma org.ma mc tm.mc asso.mc md me mg org.mg nom.mg gov.mg prd.mg tm.mg com.mg edu.mg mil.mg mh mil army.mil navy.mil mk com.mk org.mk ml mm mn mo com.mo net.mo org.mo edu.mo gov.mo mobi mp mq mr ms mt org.mt com.mt gov.mt edu.mt net.mt mu com.mu co.mu museum mv aero.mv biz.mv com.mv coop.mv edu.mv gov.mv info.mv int.mv mil.mv museum.mv name.mv net.mv org.mv pro.mv mw ac.mw co.mw com.mw coop.mw edu.mw gov.mw int.mw museum.mw net.mw org.mw mx com.mx net.mx org.mx edu.mx gob.mx my com.my net.my org.my gov.my edu.my mil.my name.my mz na name nc ne net nf ng edu.ng com.ng gov.ng org.ng net.ng ni gob.ni com.ni edu.ni org.ni nom.ni net.ni nl no mil.no stat.no kommune.no herad.no priv.no vgs.no fhs.no museum.no fylkesbibl.no folkebibl.no idrett.no np com.np org.np edu.np net.np gov.np mil.np nr gov.nr edu.nr biz.nr info.nr nr org.nr com.nr net.nr nu nz ac.nz co.nz cri.nz gen.nz geek.nz govt.nz iwi.nz maori.nz mil.nz net.nz org.nz school.nz om com.om co.om edu.om ac.com sch.om gov.om net.om org.om mil.om museum.om biz.om pro.om med.om org
  • 13. Re: How can I identify a base domain name from a given URL String
    807580 Newbie
    Currently Being Moderated
    P-Z

    pa com.pa ac.pa sld.pa gob.pa edu.pa org.pa net.pa abo.pa ing.pa med.pa nom.pa pe com.pe org.pe net.pe edu.pe mil.pe gob.pe nom.pe pf com.pf org.pf edu.pf pg com.pg net.pg ph com.ph gov.ph pk com.pk net.pk edu.pk org.pk fam.pk biz.pk web.pk gov.pk gob.pk gok.pk gon.pk gop.pk gos.pk pl com.pl biz.pl net.pl art.pl edu.pl org.pl ngo.pl gov.pl info.pl mil.pl waw.pl warszawa.pl wroc.pl wroclaw.pl krakow.pl poznan.pl lodz.pl gda.pl gdansk.pl slupsk.pl szczecin.pl lublin.pl bialystok.pl olsztyn.pl.torun.pl pm pn pr biz.pr com.pr edu.pr gov.pr info.pr isla.pr name.pr net.pr org.pr pro.pr pro law.pro med.pro cpa.pro ps edu.ps gov.ps sec.ps plo.ps com.ps org.ps net.ps pt com.pt edu.pt gov.pt int.pt net.pt nome.pt org.pt publ.pt pw py net.py org.py gov.py edu.py com.py qa re ro com.ro org.ro tm.ro nt.ro nom.ro info.ro rec.ro arts.ro firm.ro store.ro www.ro ru com.ru net.ru org.ru pp.ru msk.ru int.ru ac.ru rw gov.rw net.rw edu.rw ac.rw com.rw co.rw int.rw mil.rw gouv.rw sa com.sa edu.sa sch.sa med.sa gov.sa net.sa org.sa pub.sa sb com.sb gov.sb net.sb edu.sb sc com.sc gov.sc net.sc org.sc edu.sc sd com.sd net.sd org.sd edu.sd med.sd tv.sd gov.sd info.sd se org.se pp.se tm.se brand.se parti.se press.se komforb.se kommunalforbund.se komvux.se lanarb.se lanbib.se naturbruksgymn.se sshn.se fhv.se fhsk.se fh.se mil.se ab.se c.se d.se e.se f.se g.se h.se i.se k.se m.se n.se o.se s.se t.se u.se w.se x.se y.se z.se ac.se bd.se sg com.sg net.sg org.sg gov.sg edu.sg per.sg idn.sg sh si sj sk sl sm sn so sr st su sv edu.sv com.sv gob.sv org.sv red.sv sy gov.sy com.sy net.sy sz tc td tel tf tg th ac.th co.th in.th go.th mi.th or.th net.th tj ac.tj biz.tj com.tj co.tj edu.tj int.tj name.tj net.tj org.tj web.tj gov.tj go.tj mil.tj tk tl tm tn com.tn intl.tn gov.tn org.tn ind.tn nat.tn tourism.tn info.tn ens.tn fin.tn net.tn to gov.to tp gov.tp tr com.tr info.tr biz.tr net.tr org.tr web.tr gen.tr av.tr dr.tr bbs.tr name.tr tel.tr gov.tr bel.tr pol.tr mil.tr k12.tr edu.tr travel tt co.tt com.tt org.tt net.tt biz.tt info.tt pro.tt name.tt edu.tt gov.tt tv gov.tv tw edu.tw gov.tw mil.tw com.tw net.tw org.tw idv.tw game.tw ebiz.tw club.tw tz co.tz ac.tz go.tz or.tz ne.tz ua com.ua gov.ua net.ua edu.ua org.ua cherkassy.ua ck.ua chernigov.ua cn.ua chernovtsy.ua cv.ua crimea.ua dnepropetrovsk.ua dp.ua donetsk.ua dn.ua ivano-frankivsk.ua if.ua kharkov.ua kh.ua kherson.ua ks.ua khmelnitskiy.ua km.ua kiev.ua kv.ua kirovograd.ua kr.ua lugansk.ua lg.ua lutsk.ua lviv.ua nikolaev.ua mk.ua odessa.ua od.ua poltava.ua pl.ua rovno.ua rv.ua sebastopol.ua sumy.ua ternopil.ua te.ua uzhgorod.ua vinnica.ua vn.ua zaporizhzhe.ua zp.ua zhitomir.ua zt.ua ug co.ug ac.ug sc.ug go.ug ne.ug or.ug uk ac.uk co.uk gov.uk ltd.uk me.uk mil.uk mod.uk net.uk nic.uk nhs.uk org.uk plc.uk police.uk sch.uk bl.uk british-library.uk icnet.uk jet.uk nel.uk nls.uk national-library-scotland.uk parliament.uk um us ak.us al.us ar.us az.us ca.us co.us ct.us dc.us de.us dni.us fed.us fl.us ga.us hi.us ia.us id.us il.us in.us isa.us kids.us ks.us ky.us la.us ma.us md.us me.us mi.us mn.us mo.us ms.us mt.us nc.us nd.us ne.us nh.us nj.us nm.us nsn.us nv.us ny.us oh.us ok.us or.us pa.us ri.us sc.us sd.us tn.us tx.us ut.us vt.us va.us wa.us wi.us wv.us wy.us uy edu.uy gub.uy org.uy com.uy net.uy mil.uy uz va vatican.va vc ve com.ve net.ve org.ve info.ve co.ve web.ve vg vi com.vi org.vi edu.vi gov.vi vn com.vn net.vn org.vn edu.vn gov.vn int.vn ac.vn biz.vn info.vn name.vn pro.vn health.vn vu wf ws ye com.ye net.ye yt yu ac.yu co.yu org.yu edu.yu za ac.za city.za co.za edu.za gov.za law.za mil.za nom.za org.za school.za alt.za net.za ngo.za tm.za web.za zm co.zm org.zm gov.zm sch.zm ac.zm zw co.zw org.zw gov.zw ac.zw
  • 14. Re: How can I identify a base domain name from a given URL String
    791266 Explorer
    Currently Being Moderated
    Your point is what?
1 2 Previous Next