Archive for the ‘Event Type’ Category

New course – Web mining, spring 2009, period 3, 7,5 hp

Thursday, November 11th, 2010

Internet contains a huge amount of information, which is rapidly growing at an ever increasing pace. People, organizations and corporations from the whole world are adding different types of information to the web continuously in various languages. The web therefore contains potentially very interesting and valuable information. This course will investigate various techniques for processing the Web in order to extract such information, refine it and make it more structured, thus making it both more valuable and accessible. These techniques are often referred to as web mining techniques.

The domains within the Internet that we will study are databases, e-commerce web sites, wikis, virtual communities and blogs. Semantic web and Web 2.0 are two other concepts that are relevant for the course. Web mining is considered to contain three main areas, namely web content mining, web structure mining and web usage mining. Web structure mining is closely related to information search techniques, and web usage mining to opinion mining or sentiment analysis. Also related is the automatic construction of sociograms. Web content mining can for example be used to find the cheapest airline tickets, by monitoring all web based databases of all airlines in order to attempt to find the lowest common denominator of all databases.
Web mining techniques explored in the course are human language technology, machine learning, statistics, information retrieval and extraction, text mining, text summarization, automatic classification, clustering, wrapper induction, normalization of data, match cardinality of data in different databases, interface matching, schema matching, sentiment analysis, opinion mining, extraction of comparatives, forensic linguistics etc.

Read more here Web Mining/Web-mining, WEBMIN/IV2038, 7.5 hp.

IMAIL-Intelligent e-services for eGovernment accepted by Vinnova

Thursday, November 11th, 2010

The project vision is to design and develop eGovernment services based on human language technology tools that facilitate efficient communication between government agencies and citizens and companies, which will lead to a transformed and improved government.
The overall goal of the demonstrator is to show how further development of today´s tools and technologies can improve the communication between large organizations and people. The demonstrator will run on Försäkringskassan (Swedish Social Insurance Agency) and help to automate the communication between these organizations and the people by processing text-based inquiries, primarily e-mail based queries.
Our tools and technologies will:
1. automate answering of a large part of the incoming e-mail flow,
2. improve right-on-time answers to inquiries asked through electronic devices.
3. change the workload for the administrators at Försäkringskassan and use their skills in a better way.

The project will be carried out in cooperation with Försäkringskassan, Statistics Sweden, SCB and the Human Language Technology Group at CSC/KTH

Press release Vinnova: Användarna i fokus i VINNOVA-satsning på e-förvaltning> in Swedish.

New project proposal to Vinnova: Beslutsstöd genom utforskning av patientjournaler

Thursday, November 11th, 2010

Sammanfattning:
I projektet ämnar vi konstruera en demonstrator från våra befintliga verktyg som vi har utvecklat inom det Vinnovastödda KEA-projektet. Med verktyget kommer användaren att kunna utforska helt avidentifierade patientjournaler för att finna både synliga och dolda samband mellan sjukdomar, diagnoser, kost, social situation, medicinering, mm. Indata som kommer att användas för att åstadkomma detta är hundratusentals patientjournaler som forskargruppen på DSV/Stockholms universitet har fått tillgång från Stockholms läns landsting.
Demonstratorn kommer att efter projekttidens tid att kommersialiseras för att komma sjukvården och samhället till nytta.

GSLT-Retreat, Gullmarsstrand, at the West coast, January 26-28, 2009

Thursday, November 11th, 2010

Sumithra Velupillai and Hercules Dalianis participated at the yearly GSLT, Graduate School of Language Technology Retreat. There where over 50 participants at the conference; PhD-students, supervisors and alumni from the whole of Sweden, see here for the conference programme.

DSV/KTH-Stockholm University is part of GSLT. At DSV we have one more GSLT PhD student at DSV namely Atelach Argaw who is supervised by associate professor Lars Asker. Dr. Martin Hassel at DSV is also one of the GSLT supervisors and he is also supervisor of Sumithra.

Sumithra presented two posters, one with title Mixing and Blending Syntactic and Semantic Dependencies. The research was been carried out during a GSLT phd-course in Machine learning together with the PhD students Yvonne Samuelsson, Oscar Täckström, Johan Eklund, Mark Fishel, Mark and Markus Saers. This poster was also presented at the Coling workshop CoNLL, in August 2008.

Hercules presented our patient data corpus, the Stockholm EPR corpus, and some experiments we have carried out on it.

Some photos from the GSLT-Retreat:

Find more photos like this on .

Master thesis: Using parallel corpora and Uplug to create a Chinese-English dictionary. Defended December 10, 2008

Thursday, November 11th, 2010

Authors: Hao-chun Xing (EMIS) & Xin Zhang (EMIS)

Abstract

This master thesis is about using parallel corpora and word alignment to
automatically create a bilingual Chinese-English dictionary. The dictionary
can contribute to multilingual information retrieval or to a domain specific
dictionary.

However, the creation of bilingual dictionaries is a difficult task
especially in the case of the Chinese language. Chinese is very different
from European languages. Chinese has no delimiters to mark word boundaries
as in European languages. Therefore we needed Chinese word segmentation
software to insert the boundaries between each Chinese word in order to
correspond with English words. That was one of the difficult issues in our
project. We spend half time on it.

Our parallel corpora consists of 104,563 Chinese characters, that is
approximately 50,000-60,000 Chinese words, and 75,997 English words, mainly
law texts. We used ICTCLAS as the Chinese word segmentation software to
pre-process the raw Chinese text, and then we used the word alignment system
Uplug to process the prepared parallel corpora and to create the Chinese-
English dictionary.

Our dictionary contains 2,118 entries. We evaluated the results with the
assistance of nine native Chinese speakers. The average accuracy of
dictionary is 74.1 percent.

Key Words: Parallel Corpora, Chinese Word Segmentation, Uplug, Word
Alignment Tool

Download master thesis

Seminar: Work on Electronic Health Records – from texts to improved health care, Dec 2, 9-12

Thursday, November 11th, 2010

9 – 10: Louhi project: Text mining of Finnish nursing narratives
[Slides text mining] [Slides parsing]
For more information about the project, click here.


10 – 11: DSV-KEA project: The Stockholm-EPR Corpus and some experiments.

[slides]

For more information about the project, click here.

11 – 12: LIME, Karolinska institutet, Sabine Koch: Integrating electronic health records to bridge health and social care,[slides]
For more information about the project, click here.

Tuesday, December 2, 2008, sammanträdesrum 6405, floor 6, Forum, DSV, Kista

(EPR = Electronic Patient Records)

Find more photos like this on .
Some notes on the presentations:
The Louhi project at University of Turku carries out research at intensive care unit patient records written in Finnish. According to the participants of the Louhi project 20-30 percent of the clinicians’ time goes to documenting during the health care process, many times the clinicians also need to enter the same information in several different medical record systems. Sometimes it can be up to 50 pages of documentation for one ICU intensive care unit patient. This information is very difficult to obtain an overview of, therefore very little of this written information is reused. Almost none of the information in the EPRs is transferred to the discharge letter.

The DSV-KEA group presented the Stockholm-EPR corpus that contains several hundred thousand patient records. From the Stockholm-EPR corpus a Gold corpus was created that will assist in assessing the quality of a deidentification system for Swedish patient records. The Gold corpus contains 100 patient records and contains in average 4 200 PHIs, (Protected Health Information) that is 2.5 percent of the total amount of information. Name on both clinicians and patients were about 0.75 percent of the total amount of information.

Sabine Koch presented a system from the Vinnova supported project old@home. In the project handheld devices were developed to support in the health care care process. In this system all people involved in the health care care process obtained customized information about the patient. The users are the patients themselfes, the nurses, the home helpers, the relatives, etc. The system is now used partly in Hudiksvalls kommun, (muncipality).

Ny upplaga av Informationssökning på Internet av Våge, Dalianis och Iselid, Studentlitteratur

Thursday, November 11th, 2010

Mera information om boken

New Project Proposal to Vinnova – IMAIL-Intelligent e-mail answering service for eGovernment

Thursday, November 11th, 2010

We, Martin Hassel, Eriks Sneiders, Tessy Ceratto, Ola Knutsson (CSC), Viggo Kann (CSC) and Magnus Rosell (CSC) are preparing an application to Vinnova – Deadline sept 2, 2008: Title: IMAIL-Intelligent mail answering service for eGovernment, other partners Försäkringskassan (Swedish Social Insurance Agency) and Euroling AB

Abstract
The project vision is to design and develop eGovernment services that facilitate efficient communication between government agencies and citizens and companies, which will lead to a transformed and improved government.
The overall goal of the demonstrator is to show how further development of today´s tools and technologies can improve the communication between large organizations and people. The demonstrator will run on Försäkringskassan and help to automate the communication between these organizations and the people by processing text-based inquiries, primarily e-mail based queries.
Our tools and technologies will
1. automate answering of a large part of the incoming e-mail flow,
2. improve right-on-time answers to inquiries asked through electronic devices.

Two year project = 4.6 million SEK

Talk: “A new lemmatizer that handles morphological changes in pre- in- and suffixes alike” by Bart Jongejan

Thursday, November 11th, 2010

A new lemmatizer that handles morphological changes in pre- in- and suffixes alike
talk by Bart Jongejan, CST, University of Copenhagen, Tuesday, May 6, 2008, at 13.00-14.45, sammanträdesrummet 7501, Forum, DSV, Kista.

In some Indo-European languages like English and the North Germanic languages, most words can be lemmatized by removing or replacing a suffix. In languages like German and Dutch, on the other hand, lemmatization often proceeds regularly by removing, adding or replacing other types of affixes and even by combinations of such string operations.

The rules for the new lemmatizer are created by automatic training on a large sample set of full form – lemma pairs. An attempt was made to allow a rule-based attribution of a word to more than one lemma (when appropriate), but this had to be given up. The current implementation produces one lemma per word when the lemmatization rules are applied and relies on an optional built-in dictionary to produce additional correct lemmas of known words only.

The first results of tests show that the new lemmatizer probably has a higher accuracy than the former CSTlemma software, even with languages that have mainly suffix morphology, but that the errors it makes sometimes may be “more wrong” than the errors made by the old CSTlemma software.

AVID-Deidentifying Swedish Medical Records for Better Health Care submitted April 15, 2008 to Vetenskapsrådet

Thursday, November 11th, 2010

Within hospital care there has been an explosion in the production of medical record data. A large amount of this data is unstructured free-text that is almost never reused. Our research group will soon have access to more than one million medical records from the Stockholm City Council. Currently, we already have access to 5 000 medical records within rheumatology. Unfortunately the free-text of the medical records very often contains misspellings, syntactical errors as well as plenty of unknown abbreviations and is therefore difficult to process by computers. In order to use the free-text corpus for research purposes it is also necessary to deidentify the texts since they typically contain information that can identify the individual patient. In this project we will therefore normalise and deidentify the medical records and we expect to reach 99 percent deidentification. When this is carried out we and the research community have the possibility to use human language technology tools such as text mining and text extraction methods to find previously unchartered relations between diseases, medical treatment, age, occupation, social situation, etc. One primary goal with this project is thus to make it possible for researchers in medicine to use the abundant digital textual information that is available in medical records. Such research has never previously been carried out in Sweden, and is unique due to the kind of and large amount of textual data being used.

Popular scientific description in Swedish:

Avidentifiering av patientjournaler fo?r ba?ttre ha?lsova?rd

Inom sjukva?rden produceras ett mycket stort antal digitala patientjournaler av la?kare och sjuksko?terskor. Journalerna inneha?ller information om patientens allma?ntillsta?nd, symptom, diagnos och behandling. Dessa patientjournaler inneha?ller tillsammans va?rdefull information och sa?rskilt delar i fritext som inte alls utnyttjas i den medicinska forskningen. Vi har tidigare gjort experiment pa? 5 000 avidentifierade patientjournaler inom reumatologi och hittat tva? problem:

Ett problem a?r att journalerna trots att de har avidentifierats fo?r att kunna utnyttjas i forskningen fortfarande inneha?ller information som kan go?ra att patienterna kan identifieras eftersom det bland annat refereras till patientens yrke (VD-position pa? Alfa Laval), eller familjemedlemmar och telefonnummer (patientens man Bengt-A?ke na?s pa? telefonummer 08-123 4567). Det andra problemet a?r att journaltexterna inneha?ller ma?nga felstavningar och grammatiska fel men a?ven tvetydiga fo?rkortningar som go?r dem sva?ra att bearbetas av dataprogram.
Vi a?mnar da?rfo?r i detta forskningsprojekt dels ordna att dessa patientjournalerna korrigeras fra?n felstavningar och fa?r en enhetlig stavning av begrepp och dels att texten avidentifieras. Ba?de ra?ttstavning och avidentifiering av texterna kommer att ske med helt automatiska spra?kteknologiska metoder. Vi kommer att utga? fra?n drygt en miljon patientjournaler som vi snart kommer att fa? tillga?ng till genom Stockholms la?ns landsting.

Dessa patientjournaltexter a?r det material vi kommer att la?ta va?ra system tra?nas upp pa? sa? att de la?r sig att ka?nna igen nya begrepp. De automatiska metoderna fo?r automatisk namnigenka?nning och da?rmed avidentifiering kan skapas antingen genom regelbaserade eller statistikbaserade metoder. Med dessa metoder kan man sedan automatiskt ka?nna igen personnamn, yrken, platser, organisationer, mm. Na?r detta a?r utfo?rt kommer vi med att ha ett stort antal patientjournaler med kanske upp till 99 procent helt avidentifierat inneha?ll som mo?jliggo?r forskning pa? ett unikt material. Vi hoppas kunna tillga?ngliggo?ra va?r rentva?ttade patientjournalkorpus och va?ra framtagna spra?kteknologiska verktyg till Svensk Nationell Datatja?nst (SND) fo?r att a?stadkomma vidare spridning.

Det automatiska ra?ttstavningssystemet bygger pa? regler fo?r hur felstavade ord i en text kan korrigeras. Ra?ttstavningsystemet anva?nder sig av ba?de lexikon och fo?rkortningslistor och kommer att korrigera de felstavade orden i patientjournalerna, men vi kommer a?ven att anva?nda oss av speciella medicinska ordlistor som t.ex. FASS-listor med la?kemedelsnamn. Patientjournaltexterna med o?ver en miljon patientjournaler ga?r ocksa? att utnyttja fo?r att ta fram nya doma?nspecifika ordlistor, da? kan man la?ta de vanligaste stavningarna av orden “vinna o?ver” de ovanligare stavningarna av orden.
Forskningen som kan go?ras pa? dessa patientjournaler a?r ba?de traditionell so?kning inom en individs samlade journaltext men ocksa? bland flera individer. Viktigast av allt a?r att man kommer att ha ett stort material som samlar va?rdefull information om ett stort antal patienter, som man kan utnyttja fo?r att extrahera ny information och kunskap.

Projektet har tva? ma?l: dels att skapa en stor avidentifierad patientjournalskorpus pa? svenska fo?r forskningssa?ndama?l, och dels ge forskarva?rlden tillga?ng till i projektet framtagna spra?kteknologiska verktyg fo?r avidentifiering och arbete med liknande textma?ngder. I och med detta kommer man i framtiden enkelt kunna skapa nya avidentifierade textma?ngder och arbeta med stora, informationsta?ta
textma?ngder.

Va?rt projekt a?r unikt sa?tillvida att det a?r fo?rsta ga?ngen na?gon kommer att genomfo?ra avidentifiering och rentva?ttning av drygt en miljon patientjournaltexter (pa? svenska). Tidigare arbete har oftast ro?rt sig om ho?gst na?gra fa? tusen patientjournaler pa? engelska. Denna forskning a?r mycket relevant eftersom den kommer att bidra till att ha?lsova?rden kommer att kunna utnyttja alla de samlade kunskaperna som finns skrivna i fri text tillsammans med mer “ha?rda” ma?tva?rden och genom detta kunna hitta ny kunskap fo?r ba?ttre ha?lsova?rd.