Archive for the ‘Uncategorized’ Category

Cross language search article accepted to NODALIDA 2009, May 15-16 Odense, Denmark

Thursday, November 11th, 2010

The article with the title Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian has been accepted for presentation at the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009) 15-16 maj, Odense, Denmark. The authors are Hercules Dalianis, Martin Rimka and Viggo Kann from CSC/KTH.

The paper will be posted when camera-ready.

Nordforsk Network application meeting in Åre, March 15-17, 2009

Thursday, November 11th, 2010

A group of researchers and PhD-students met in Åre, Sweden to write an application for a research network. We aim to get funding for the HEalth teXt Analysis network in the Nordic and Baltic countries (HEXAnord), for constructing synthetic patient records for the development and evaluation of textmining tools for medical health records.We will also apply for research training courses. The funder is Nordforsk, Nordic council. The applicants came from the Human language technology group, Martin Hassel, Sumithra and Hercules SYSLAB, DSV, Øystein Nytrø, Laura Slaughter and Karen Marie Lyng the Norwegian University of Science and Technology, NTU and Fransisco Roque, the Danish Technical University, DTU. Other involved partners are University of Turku, Finland, University of Tartu, Estonia and Vytautas Magnus University, Lithuania.

Click here for Åre photo album

GSLT-Retreat, Gullmarsstrand, at the West coast, January 26-28, 2009

Thursday, November 11th, 2010

Sumithra Velupillai and Hercules Dalianis participated at the yearly GSLT, Graduate School of Language Technology Retreat. There where over 50 participants at the conference; PhD-students, supervisors and alumni from the whole of Sweden, see here for the conference programme.

DSV/KTH-Stockholm University is part of GSLT. At DSV we have one more GSLT PhD student at DSV namely Atelach Argaw who is supervised by associate professor Lars Asker. Dr. Martin Hassel at DSV is also one of the GSLT supervisors and he is also supervisor of Sumithra.

Sumithra presented two posters, one with title Mixing and Blending Syntactic and Semantic Dependencies. The research was been carried out during a GSLT phd-course in Machine learning together with the PhD students Yvonne Samuelsson, Oscar Täckström, Johan Eklund, Mark Fishel, Mark and Markus Saers. This poster was also presented at the Coling workshop CoNLL, in August 2008.

Hercules presented our patient data corpus, the Stockholm EPR corpus, and some experiments we have carried out on it.

Some photos from the GSLT-Retreat:

Find more photos like this on .

Master thesis: Using parallel corpora and Uplug to create a Chinese-English dictionary. Defended December 10, 2008

Thursday, November 11th, 2010

Authors: Hao-chun Xing (EMIS) & Xin Zhang (EMIS)

Abstract

This master thesis is about using parallel corpora and word alignment to
automatically create a bilingual Chinese-English dictionary. The dictionary
can contribute to multilingual information retrieval or to a domain specific
dictionary.

However, the creation of bilingual dictionaries is a difficult task
especially in the case of the Chinese language. Chinese is very different
from European languages. Chinese has no delimiters to mark word boundaries
as in European languages. Therefore we needed Chinese word segmentation
software to insert the boundaries between each Chinese word in order to
correspond with English words. That was one of the difficult issues in our
project. We spend half time on it.

Our parallel corpora consists of 104,563 Chinese characters, that is
approximately 50,000-60,000 Chinese words, and 75,997 English words, mainly
law texts. We used ICTCLAS as the Chinese word segmentation software to
pre-process the raw Chinese text, and then we used the word alignment system
Uplug to process the prepared parallel corpora and to create the Chinese-
English dictionary.

Our dictionary contains 2,118 entries. We evaluated the results with the
assistance of nine native Chinese speakers. The average accuracy of
dictionary is 74.1 percent.

Key Words: Parallel Corpora, Chinese Word Segmentation, Uplug, Word
Alignment Tool

Download master thesis

Seminar: Work on Electronic Health Records – from texts to improved health care, Dec 2, 9-12

Thursday, November 11th, 2010

9 – 10: Louhi project: Text mining of Finnish nursing narratives
[Slides text mining] [Slides parsing]
For more information about the project, click here.


10 – 11: DSV-KEA project: The Stockholm-EPR Corpus and some experiments.

[slides]

For more information about the project, click here.

11 – 12: LIME, Karolinska institutet, Sabine Koch: Integrating electronic health records to bridge health and social care,[slides]
For more information about the project, click here.

Tuesday, December 2, 2008, sammanträdesrum 6405, floor 6, Forum, DSV, Kista

(EPR = Electronic Patient Records)

Find more photos like this on .
Some notes on the presentations:
The Louhi project at University of Turku carries out research at intensive care unit patient records written in Finnish. According to the participants of the Louhi project 20-30 percent of the clinicians’ time goes to documenting during the health care process, many times the clinicians also need to enter the same information in several different medical record systems. Sometimes it can be up to 50 pages of documentation for one ICU intensive care unit patient. This information is very difficult to obtain an overview of, therefore very little of this written information is reused. Almost none of the information in the EPRs is transferred to the discharge letter.

The DSV-KEA group presented the Stockholm-EPR corpus that contains several hundred thousand patient records. From the Stockholm-EPR corpus a Gold corpus was created that will assist in assessing the quality of a deidentification system for Swedish patient records. The Gold corpus contains 100 patient records and contains in average 4 200 PHIs, (Protected Health Information) that is 2.5 percent of the total amount of information. Name on both clinicians and patients were about 0.75 percent of the total amount of information.

Sabine Koch presented a system from the Vinnova supported project old@home. In the project handheld devices were developed to support in the health care care process. In this system all people involved in the health care care process obtained customized information about the patient. The users are the patients themselfes, the nurses, the home helpers, the relatives, etc. The system is now used partly in Hudiksvalls kommun, (muncipality).

New course – Web mining, spring 2009, period 3, 7,5 hp

Thursday, November 11th, 2010

Internet contains a huge amount of information, which is rapidly growing at an ever increasing pace. People, organizations and corporations from the whole world are adding different types of information to the web continuously in various languages. The web therefore contains potentially very interesting and valuable information. This course will investigate various techniques for processing the Web in order to extract such information, refine it and make it more structured, thus making it both more valuable and accessible. These techniques are often referred to as web mining techniques.

The domains within the Internet that we will study are databases, e-commerce web sites, wikis, virtual communities and blogs. Semantic web and Web 2.0 are two other concepts that are relevant for the course. Web mining is considered to contain three main areas, namely web content mining, web structure mining and web usage mining. Web structure mining is closely related to information search techniques, and web usage mining to opinion mining or sentiment analysis. Also related is the automatic construction of sociograms. Web content mining can for example be used to find the cheapest airline tickets, by monitoring all web based databases of all airlines in order to attempt to find the lowest common denominator of all databases.
Web mining techniques explored in the course are human language technology, machine learning, statistics, information retrieval and extraction, text mining, text summarization, automatic classification, clustering, wrapper induction, normalization of data, match cardinality of data in different databases, interface matching, schema matching, sentiment analysis, opinion mining, extraction of comparatives, forensic linguistics etc.

Read more here Web Mining/Web-mining, WEBMIN/IV2038, 7.5 hp.

IMAIL-Intelligent e-services for eGovernment accepted by Vinnova

Thursday, November 11th, 2010

The project vision is to design and develop eGovernment services based on human language technology tools that facilitate efficient communication between government agencies and citizens and companies, which will lead to a transformed and improved government.
The overall goal of the demonstrator is to show how further development of today´s tools and technologies can improve the communication between large organizations and people. The demonstrator will run on Försäkringskassan (Swedish Social Insurance Agency) and help to automate the communication between these organizations and the people by processing text-based inquiries, primarily e-mail based queries.
Our tools and technologies will:
1. automate answering of a large part of the incoming e-mail flow,
2. improve right-on-time answers to inquiries asked through electronic devices.
3. change the workload for the administrators at Försäkringskassan and use their skills in a better way.

The project will be carried out in cooperation with Försäkringskassan, Statistics Sweden, SCB and the Human Language Technology Group at CSC/KTH

Press release Vinnova: Användarna i fokus i VINNOVA-satsning på e-förvaltning> in Swedish.

New project proposal to Vinnova: Beslutsstöd genom utforskning av patientjournaler

Thursday, November 11th, 2010

Sammanfattning:
I projektet ämnar vi konstruera en demonstrator från våra befintliga verktyg som vi har utvecklat inom det Vinnovastödda KEA-projektet. Med verktyget kommer användaren att kunna utforska helt avidentifierade patientjournaler för att finna både synliga och dolda samband mellan sjukdomar, diagnoser, kost, social situation, medicinering, mm. Indata som kommer att användas för att åstadkomma detta är hundratusentals patientjournaler som forskargruppen på DSV/Stockholms universitet har fått tillgång från Stockholms läns landsting.
Demonstratorn kommer att efter projekttidens tid att kommersialiseras för att komma sjukvården och samhället till nytta.

New Project Proposal to Vinnova – IMAIL-Intelligent e-mail answering service for eGovernment

Thursday, November 11th, 2010

We, Martin Hassel, Eriks Sneiders, Tessy Ceratto, Ola Knutsson (CSC), Viggo Kann (CSC) and Magnus Rosell (CSC) are preparing an application to Vinnova – Deadline sept 2, 2008: Title: IMAIL-Intelligent mail answering service for eGovernment, other partners Försäkringskassan (Swedish Social Insurance Agency) and Euroling AB

Abstract
The project vision is to design and develop eGovernment services that facilitate efficient communication between government agencies and citizens and companies, which will lead to a transformed and improved government.
The overall goal of the demonstrator is to show how further development of today´s tools and technologies can improve the communication between large organizations and people. The demonstrator will run on Försäkringskassan and help to automate the communication between these organizations and the people by processing text-based inquiries, primarily e-mail based queries.
Our tools and technologies will
1. automate answering of a large part of the incoming e-mail flow,
2. improve right-on-time answers to inquiries asked through electronic devices.

Two year project = 4.6 million SEK

Talk: “A new lemmatizer that handles morphological changes in pre- in- and suffixes alike” by Bart Jongejan

Thursday, November 11th, 2010

A new lemmatizer that handles morphological changes in pre- in- and suffixes alike
talk by Bart Jongejan, CST, University of Copenhagen, Tuesday, May 6, 2008, at 13.00-14.45, sammanträdesrummet 7501, Forum, DSV, Kista.

In some Indo-European languages like English and the North Germanic languages, most words can be lemmatized by removing or replacing a suffix. In languages like German and Dutch, on the other hand, lemmatization often proceeds regularly by removing, adding or replacing other types of affixes and even by combinations of such string operations.

The rules for the new lemmatizer are created by automatic training on a large sample set of full form – lemma pairs. An attempt was made to allow a rule-based attribution of a word to more than one lemma (when appropriate), but this had to be given up. The current implementation produces one lemma per word when the lemmatization rules are applied and relies on an optional built-in dictionary to produce additional correct lemmas of known words only.

The first results of tests show that the new lemmatizer probably has a higher accuracy than the former CSTlemma software, even with languages that have mainly suffix morphology, but that the errors it makes sometimes may be “more wrong” than the errors made by the old CSTlemma software.