Found 50 repositories(showing 30)
JiangYanting
术语词典数据集/分词词典/专业词表语料库/词汇知识库/领域词表下载/主题词表/词库/自然语言处理/数据挖掘/深度学习
google-research-datasets
A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'. The 'assistant' elicits the 'user’s' preferences about movies following a Coached Conversational Preference Elicitation (CCPE) method. The assistant asks questions designed to minimize the bias in the terminology the 'user' employs to convey his or her preferences as much as possible, and to obtain these preferences in natural language. Each dialog is annotated with entity mentions, preferences expressed about entities, descriptions of entities provided, and other statements of entities.
mtresearcher
Terminology Dataset
Gitster7
In this project we will be building a model capable of generating notes and chords after learning from the dataset of songs we provide to our recurrent neural network and create songs. Before we start, let us recall few of the basic concepts and terminologies we will be using in this project and add to that knowledge the concepts required to successfully train our model to be an excellent composer.
TOUATI-MOHAMED
Nowadays, chatbots has emerged in all domains and proved its efficiency in helping assistants saving time and managing interactions with customers. However, communicating with a conversational agent seems frustrating sometimes, especially when the goal of this chatbot is to help users to overcome their problems. That is why, NLP researchers has developed a new terminology that aims to make conversations with virtual assistants hat does not sound or behave like robots but as human like as possible. The tool which can enhance this is sentiment analysis. In this context, Pixemantic decides to include a Positive chatbot in its new platform called Dr.Happy, that aims to help users overcome their depression, anxiety, and daily-life problems.In order to achieve its goal, Pixemantic should have a solid dataset that allows to the chatbot to discuss in different topics with users and assist them with solutions for their issues. we will cover the main milestones of the preparation of the needed dataset for Positive chatbot
zequnl
The implementation of the EMNLP 2021 paper "Graphine: A Dataset for Graph-aware Terminology Definition Generation"
dsfsi
Official DSFSI Public Datasets Registry - Comprehensive catalog of 50+ datasets for South African & African languages. Includes speech recognition, NLP, terminology, health, legal & financial data across HuggingFace, GitHub, Zenodo & more.
Songrui9269
MapReduce is a programming model that involves two steps. The first, the map step, takes an input set I and groups it into N equivalence classes I0, I1, I2, ..., IN-1. I can be thought of as a set of tuples <key, data>, and the function map maps I into the equivalence classes based on the value of key. In the second reduce step, the equivalence classes are processed, and the set of tuples in an equivalence class Ij are reduced into a single value. MapReduce has become very popular in part because of its use by Google, but is an old parallel programming model. It is surprisingly general. To perform a parallel MapReduce, the input is spread across the available processors. Each processor runs one or more instances of map, followed by executing one or more instances of reduce. Each instance of map will potentially form equivalence classes I0, I1, I2, ..., IN-1. Consider the word counting problem, which can be solved in parallel using MapReduce. Given a list of words, the output should consist of how many times each word appeared in the list (or text). Viewing the input as tuples, the word is the key, and the data is the constant 1. A naive map function would collect all instances of a word into an equivalence class. Each equivalence class would then be assigned to a process pr, and process pr would determine the cardinality of the equivalence classes from all maps, which would be the word count. A more intelligent map function would form singleton equivalence classes Iword, where the only element is <word, count>. The process pr that reduces Iword would receive the Iword equivalence classes from all of the map functions, and would perform a reduction on the class. In Google terminology, the function that performs this optimization is called a combiner and executes on the same process as the map. This is important since its function is to combine many members of an equivalence class into a single member so as to decrease the volume of communicated data sent form the needed between the map and reduce stages. A second optimization that can be performed is to group multiple equivalence classes together to be sent together to the same reducer. Thus, the records for “cat”, “dog”, “test” and “homework” might be sent by different mappers to the same reducer. This enables all of the to be sent by a single communication operation, improving the efficiency of the communication. The question then becomes, how do we decide which equivalence classes to group together. This decision is done using a hash function H. Let’s say we will have R reducers. Then having a function 0 ≤ H(key) ≤ R-1 will group the equivalence classes into R groups to be sent to the R reducers. What we will program We will program a map reduce that executes on a distributed memory machine and uses OpenMP on each core to compute the map reduce. The project will be done in three steps: The OpenMP version and a wordcount map reduce (20% of the project grade) The MPI version that uses the OpenMP version to perform node-local computation with a wordcount map reduce (20% of the project grade) Final turn-in. (60% of the project grade) Details are given below. Note that even though I use OpenMP you can use Pthreads, Java or other code that supports multithreading to write the shared memory version. Note that if you use Java you will need to use Java isolates to communicate between nodes/processes. General information: The text for the map reduce will be distributed across FI input text files, where FI > Nmpi*C, where Nmpiis the number of nodes (machines and processes) used by MPI and C is the number of cores on each processor. OpenMP code (i.e. OpenMP code on a node). There will be four kinds of threads: Reader threads, which read files and put the data read (or created by self-initialization) into a work queue. For wordcount each work item will be a word. For the numerical problem, each entry can be a section of the array that a thread should work on; Mapper threads, which execute in parallel with Reader threads (at least until the Reader threads finish) and create combined records of words. I.e., if there are 2045 instances of “cat” in the files read by the program, the final output of the mapper threads will be a record that looks like <“cat”,2045>; Reducer threads that operate on work queue entries created by mapper threads and combine (reduce) them to a single record. Thus, for the word “cat”, there is potentially a <“cat”,counti> record sent by every mapper thread ti in the system and it will sum all of the counts and place it on a work queue. For each word there is exactly one Reducer thread in the system that handles it. Writer threads that take a sum from the work queue and write it to a file. Note that each process can write its results to a separate file. You may not need threads for each of these but only different work queue entries. Thus, Reader and Writer threads run at different times. Mapper and Reducer threads, within a node, can be made to run at different times. These threads can be made to do different tasks by pulling different work out of work queues. This is not mandatory, i.e., you can have different groups of threads to perform different tasks, thus you might have reader, mapper, reducer and writer threads. A work queue for each reducer thread. Mapper threads will put work items into this queue. For load balance purposes it is desirable that the range of function H that determines which reducer will get a work item be from 0 to R where R = k⋅numMappers, and k is some constant. You need to have mechanisms to ensure that Mapper threads wait until all Readers have finished before considering themselves complete, i.e. the work queue from which Mapper threads get their work may be empty at some point in time, but have data at a later point in time because an unfinished Reader thread put data in it. Mappers will need to put their data on a reducer’s work queue based on the key (word) for that data: As mentioned above, the reducer of a key should be determined by some sort of hash function g = H(key). All keys that map onto reducer g should be added to g’s work queue. Each process can assume it will be receiving data from every other node. This will simplify the communication structure of your program when you go to the MPI version. A node that sends no data should send an “empty” record letting the other process no it will get no data from it. As each process finishes its reduce work, it should write its results to an output file, close it, and notify the master thread that it is finished so that it can terminate the job, and then terminate itself. MPI version: The MPI version will use multiple nodes. Each node will run a copy of the OpenMP code above to perform local computations. A few changes need to be made to the OpenMP process on a node to communicate with the OpenMP processes running on other nodes. Instead of mappers putting their results onto a reducers work queue, they should put them onto a list to be sent to other nodes. A sender thread should be used to send the results of reducers in these lists to the appropriate node. Each node should have a receiver thread that obtains data sent to it by sender threads in other nodes The receiver thread for a node will place its received data onto work queues in the node for each reducer. Each node will read some portion of the FI > Nmpi*C input files. We could statically define the files each node will process, but this could lead to some nodes getting many big files and other nodes getting many small files. Instead, each node should request a file from a master node which will either send a filename back to the node or an “all done” that indicates that all files have been or are being processed. Performance data and tuning: You should collect performance data showing: What the bottlenecks are in the code. This might involve time Mapper threads are waiting for work from Reader threads, how long I/O takes vs. Mapping (not counting waiting for I/O on mapping) and data to support this other numbers below. How much load imbalance there is within a node. How much load imbalance there is across nodes (i.e. the difference in time between the first map node is ready to send its data and the latest/last map node is ready to send its data to be reduced. You should experiment with different numbers of Reader threads Step deliverables: For the OpenMP version: speedup numbers when using 1, 2, 4, . . . , #cores Mapper and Reader threads; For the MPI version: speedup numbers when using 1, 2, 4, …, #nodes to run the program, with Mapper and Reader threads for each core on a node (i.e. you don’t need to experiment with various numbers of nodes and cores For the final turn-in version: A paper not longer than ten pages that describes your overall strategy, performance bottlenecks, Performance numbers and implementation positives and negatives (what you are happy about, what you would like to change.) A full set of performance numbers either the word-count problem, and scaling by number of nodes, and dataset size, for the matrix multiply problem. Speedups and efficiencies for 2, 4, 8 and 16 processors. Do the Karp-Flatt analysis on 2, 4, 8 and 16 processors. Curves showing the number of Reader threads and performance, and the number of map and reduce threads and performance. Overall performance of the different parts of the map reduce, and the entire map reduce. For baseline “serial” numbers, use a system with one thread for each of the tasks above. Performance numbers for different numbers of nodes along with the various speedup metrics (speedup, efficiency and Karp-Flatt). An explanation of why you are getting the speedups you are getting. I may have a meeting with each group to have you demonstrate your code. This would likely happen during dead week. The point distribution will be 40% for a working parallel project with any speedup; 40% for the paper and presentation of your results and explanation of your results, 20% for acceptable speedups or non-trivial explanations of unacceptable speedups.
skswldndi
Translation Dataset with Terminology for Kpop Fandom
lexibank
CLDF dataset accompanying Zariquiey et al.'s "Body Part Terminology in Pano" from 2022
This repository offers an evaluation of machine translation models for healthcare, focusing on languages like Telugu, Hindi, Arabic, and Swahili. It emphasizes accuracy and medical terminology, aiming to enhance medical communication across diverse languages. The dataset used in evaluation is provided.
Comprehensive documentation of the HCC-TACE-SEG dataset, including structure, terminology, scan details, and dataset explanations.
ssmithm
SAS Program and XML map to import the VHAs Medication Reference Terminology to SAS datasets
caiyangcy
Reproduction on the EMNLP 2021 paper: Graphine: A Dataset for Graph-aware Terminology Definition Generation
EticaAI
[working-draft] Public domain datasets from Translation Initiative for COVID-19 on the format HXLTM (Multilingual Terminology in Humanitarian Language Exchange)
Dreamer38324
Fin-HypoTermQA is a benchmark dataset for systematically evaluating whether large language models (LLMs) can recognize and appropriately reject fictitious financial terminology — a critical safety challenge when deploying LLMs in high-stakes financial domains.
Explored image dataset and used data augmented technique to enhance the size of images. Further fade date into many custom models and pre-trained models (VGG, Inception, RSNET) to gain accuracy used terminology such as ensemble, transfer learning, and fine-tuning.
paulzuradzki
Source of truth for all terminology datasets in the Tuva Project.
Paveluz
Multilingual and multimodal dataset linking technical terms to annotated images for AI and educational use.
RonMallory
No description available
czgbjy
Geo-Terminology Relatedness Dataset
No description available
zhaojing956
Public dataset for bilingual terminology in advanced mathematics.
finetuning hugging face transformers model with financial terminology datasets
paulzuradzki
Code and terminology datasets for generating the CMS readmission measures.
OSU-slatelab
Interface for downloading datasets that require a UMLS Terminology Services (UTS) account.
Vijayagopikha
Assignement - 1 => Learn about new terminologies in ml and take a one dataset as a table and define the terminologies according to example.
paulzuradzki
Code and terminology datasets for generating chronic condition flags on your patient population.
Hemanth-Raja
Image Classification on CIFAR dataset, testing and explaining different functions and terminology in deep learning.
This repository contains the literature, dataset, and codes I collected on the topic of terminology extraction